QUASI-NEWTON METHODS ON GRASSMANNIANS AND 
MULTILINEAR APPROXIMATIONS OF TENSORS 
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Abstract. In this paper we proposed quasi-Newton and limited memory quasi-Newton methods 
for objective functions defined on Grassmannians or a product of Grassmannians. Specifically we 
defined BFGS and L-BFGS updates in local and global coordinates on Grassmannians or a product 
of these. We proved that, when local coordinates are used, our BFGS updates on Grassmannians 
share the same optimality property as the usual BFGS updates on Euclidean spaces. When applied to 
the best multilinear rank approximation problem for general and symmetric tensors, our approach 
yields fast, robust, and accurate algorithms that exploit the special Grassmannian structure of the 
respective problems, and which work on tensors of large dimensions and arbitrarily high order. 
Extensive numerical experiments are included to substantiate our claims. 
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1. Introduction. 

1.1. Quasi-Newton and limited memory quasi-Newton algorithms on 
Grassmannians. We develop quasi-Newton and limited memory quasi-Newton al- 
gorithms for functions defined on a Grassmannian Gr(n, r) as well as a product of 
Grassmannians Gr(ni,ri) x • • • x Gr(?ifc, r^), with BFGS and L-BFGS updates. These 
are algorithms along the lines of the class of algorithms studied by Edelman, Arias, 
and Smith in [24J and more recently, the monograph of Absil, Mahony, and Sepulchre 
in [5] . They are algorithms that respect the Riemannian metric structure of the man- 
ifolds under consideration, and not mere applications of the usual BFGS and L-BFGS 
algorithms for functions on Euclidean space. The actual computations of our BFGS 
and L-BFGS algorithms on Grassmannians, like the algorithms in [21], require nothing 
more than standard numerical linear algebra routines and can therefore take advan- 
tage of the many high quality softwares developed for matrix computations 3, 10] , In 
other words, manifold operations such as movement along geodesies and parallel trans- 
port of tangent vectors and linear operators do not require actual numerical solutions 
of the differential equations defining these operations; instead they are characterized 
as matrix operations on local and global coordinate representations (as matrices) of 
points on the Grassmannians or points on an appropriate vector bundle. 

A departure and improvement from existing algorithms for manifold optimization 
[UElGHEl! i s that we undertake a local coordinates approach. This allows our compu- 
tational costs to be reduced to the order of the intrinsic dimension of the manifold as 
opposed to the dimension of ambient Euclidean space. For a Grassmannian embedded 
in the Euclidean space of nxr matrices, i.e. Gr(n, r) C R™ xr , computations in local co- 
ordinates have r(n — r) unit cost whereas computations in global coordinates, like the 
ones in [24], have nr unit cost. This difference becomes more pronounced when we deal 
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with products of Grassmannians Gr(ni,ri)x- • • x Gr(rifc, r*fc) C M niXri x- • •xE IlfcXr ' 1 — 
for Hi = O(n) and Ti = O(r), we have a computational unit cost of 0(kr(n — r)) and 
O(krn) flops between the local and global coordinates versions the BFGS algorithms. 
More importantly, we will show that our BFGS update in local coordinates on a prod- 
uct of Grassmannians (and Grassmannian in particular) shares the same well-known 
optimality property of its Euclidean counterpart, namely, it is the best possible up- 
date of the current Hessian approximation that satisfies the secant equations and 



preserves symmetric positive definiteness (cf. Theorem 6.6). For completeness and as 
an alternative, we also provide the global coordinate version of our BFGS and l-bfgs 
algorithms analogous to the algorithms described in |24) . However the aforementioned 
optimality is not possible for BFGS in global coordinates. 

While we have limited our discussions to BFGS and l-bfgs updates, it is straight- 
forward to substitute these updates with other quasi-Newton updates (e.g. dfp or 
more general Broyden class updates) by applying the same principles in this paper. 

1.2. Multilinear approximations of tensors and symmetric tensors. In 

part to illustrate the efficiency of these algorithms, this paper also addresses the 
following two related problems about the multilinear approximations of tensors and 
symmetric tensors, which are also important problems in their own right with various 
applications in analytical chemistry [53], bioinformatics [47], computer vision [55] , 
machine learning [44, 41], neuroscience |45| . quantum chemistry [35], signal processing 
[TD1H31HH] 5 etc. See also the very comprehensive bibliography of the recent survey [55] . 
In data analytic applications, the multilinear approximation of general tensors is the 
basis behind the Tucker model |54| while the multilinear approximation of symmetric 
tensors is used in independent components analysis (141 118j and principal cumulant 
components analysis [44, 41!. The algorithms above provide a natural method to solve 
these problems that exploits their unique structures. 

The first problem is that of finding a best multilinear rank-(p, q, r) approximation 
to a tensor, i.e. approximating a given tensor A £ jj (xmxn by another tensor B £ 
l lxmx " of lower multilinear rank I3T1I201. 



min H-A-SII- 

rank(6)< (p,q,r) 

For concreteness we will assume that the norm in question is the Frobenius or Hilbert- 
Schmidt norm || • \\p. In notations that we will soon define, we seek matrices X, Y, Z 
with orthonormal columns and a tensor C £ M. pxqxr such that 

argmin \\A - (X,Y, Z) ■ C\\ F . (1.1) 

XeO(i,p),yeO(m^),ZeO(n,r),C6R ,xmx " 

The second problem is that of finding a best multilinear rank-r approximation to 
a symmetric tensor S £ S 3 (R"). In other words, we seek a matrix Q whose columns 
are mutually orthonormal, and a symmetric tensor C £ S 3 (K r ) such that a multilinear 
transformation of C by Q approximates S in the sense of minimizing a sum-of-squares 



loss. Using the same notation as in (1.1 1, the problem is 



argmin \\S - (Q,Q,Q) ■ C\\ F . (1.2) 

QSO(n,r),CGS 3 (R r ) 



This problem is significant because many important tensors that arise in applications 
are symmetric tensors. 
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We will often refer to the first problem as the general case and the second problem 
as the symmetric case. Most discussions are presented for the case of 3-tensors for 
notational simplicity but key expressions are given for tensors of arbitrary order to 
facilitate structural analysis of the problem and algorithmic implementation. The 
MATLAB codes of all algorithms in this paper are available for download at [50l [51] . 
All of our implementations will handle 3-tensors and in addition, our implementation 
of the BFGS with scaled identity as initial Hessian approximation will handle tensors 
of arbitrary order. In fact the reader will find an example of a tensor of order-10 in 



Section 11 included to show that our algorithms indeed work on high-order tensors. 

Our approach is summarized as follows. Observe that due to the unitary in- 
variance of the sum-of-squares norm || • ||f> the orthonormal matrices U,V,W in 



(1.1 1 and the orthonormal matrix Q in (1.2) are only determined up to an action 



of O(p) x 0(q) x O(r) and an action of O(r), respectively. We exploit this to our 
advantage by reducing the problems to an optimization problem on a product of 



Grassmannians and a Grassmannian, respectively. Specifically, we reduce (1.1), a 
minimization problem over a product of three Stiefel manifolds and a Euclidean space 
0(1, p) x 0(m,q) x 0(n,r) x W xqxr , to a maximization problem over a product of 



three Grassmannians Gr(l,p) x Gr(m,q) x Gr(n,r); and likewise we reduce (1.2) from 
0(n,r) x S 3 (M'') to a maximization problem over Gr(n, r). This reduction of (1.1) 
to product of Grassmannians has been exploited in [351 GUI |33] • The algorithms in 
|25l 1341 133] involve the Hessian, either explicitly or implicitly via its approximation 
on a tangent. Whichever the case, the reliance on Hessian in these methods results in 
them quickly becoming infeasible as the size of the problem increases. With this in 
mind, we consider the quasi-Newton and limited memory quasi-Newton approaches 
described in the first paragraph of this section. 

An important case not addressed in (35] [3H [33] is the multilinear approximation 



of symmetric tensors (1.2). Note that the general (1.1) and symmetric (1.2) cases are 
related but different, not unlike the way the singular value problem differs from the 
symmetric eigenvalue problem for matrices. The problem ( |1.1| for general tensors 
is linear in the entries of U, V, W (quadratic upon taking norm-squared) whereas 



the problem (1.2) for symmetric tensors is cubic in the entries of Q (sextic upon 



taking norm-squared). To the best of our knowledge, all existing solvers for (1.2) 



are unsatisfactory because they rely on algorithms for (1.1). A typical heuristic is as 
follows: find three orthonormal matrices Q\,Qi, Qz and a nonsymmetric C € jjrxrxr 
that approximates 5, 

5» (QuQzM-C, 

then artificially set Q\ = Q2 = Q3 = Q by either averaging or choosing the last 
iterate and then symmetrize C . This of course is not ideal. Furthermore, using the 
framework developed for the general tensor approximation to solve the symmetric 
tensor approximation problem will be computationally much more expensive. In 
particular, to optimize S ~ (Qi, Q2, Qz)'C without taking the symmetry into account 
incurs a fc-fold increase in computational cost relative to S « (Q, . . . ,Q) ■ C. The 



algorithm proposed in this paper solves (1.2) directly. It finds a single Q e 0(n,r) 
and a symmetric C G S 3 (R' r ) with 

S^(Q,Q,Q)-C. 



The symmetric case can often be more important than the general case, not surprising 
since symmetric tensors are common in practice, arising as higher order derivatives of 
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smooth multivariate real-valued functions, higher order moments and cumulants of a 
vector-valued random variable, etc. 

Like the Grassmann-Newton algorithms in [25 , 34 and the trust-region approach 
in [33 , 32 , the quasi-Newton algorithms proposed in this article have guaranteed con- 
vergence to stationary points and represent an improvement over Gauss-Seidel type 
coordinate-cycling heuristics like alternating least squares (als) , higher-order orthog- 
onal iteration (hooi), or higher-order singular value decomposition (hosvd) [T3]. And 
as far as accuracy is concerned, our algorithms perform as well as the algorithms in 
[231 1521 |M] and outperform Gauss-Seidel type strategies in many cases. As far as 
robustness and speed are concerned, our algorithms work on much larger problems 
and perform vastly faster than Grassmann-Newton algorithm. Asymptotically the 
memory storage requirements of our algorithms are of the same order-of-magnitude 
as Gauss-Seidel type strategies. For large problems, our Grassmann l-bfgs algorithm 
outperforms even Gauss-Seidel strategies (in this case HOOl), which is not unexpected 
since it has the advantage of requiring only a small number of prior iterates. 

We will give the reader a rough idea of the performance of our algorithms. Using 
matlab on a laptop computer, we attempted to find a solution to an accuracy within 
machine precision, i.e. w 10~ 13 . For general 3-tensors of size 200 x 200 x 200, our 
Grassmann l-bfgs algorithm took less than 13 minutes while for general 4-tensors 
of size 50 x 50 x 50 x 50, it took about 7 minutes. For symmetric 3-tensors of size 
200 x 200 x 200, our Grassmann BFGS algorithm took about 5 minutes while for 
symmetric 4-tensors of size 50 x 50 x 50 x 50, it took less than 2 minutes. In all 
cases, we seek a rank-(5, 5, 5) or rank-(5, 5, 5, 5) approximation. For a general order- 
10 tensor of dimensions 5 x ■ • • x 5, a rank-(2, . . . , 2) approximation took about 15 
minutes to reach the same accuracy as above. More extensive numerical experiments 



are reported in Section 11 The reader is welcomed to try our algorithms, which have 
been made publicly available at [50l [51] . 

1.3. Outline. The structure of the article is as follows. In Sections [2] and [3] 
we present a more careful discussion of tensors, symmetric tensors, multilinear rank, 
and their corresponding multilinear approximation problems. In Section [4] we will 
discuss how quasi-Newton methods in Euclidean space may be extended to Rieman- 
nian manifolds and, in particular, Grassmannians. Section [5] contains a discussion on 
geodesic curves and transport of vectors on Grassmannians. In Section [6] we present 
the modifications on quasi-Newton methods with BFGS updates in order for them to 
be well-defined on Grassmannians. Also, the reader will find proof of the optimal- 
ity properties of BFGS updates on products of Grassmannians. Section J7] gives the 
corresponding modifications for limited memory BFGS updates. Section [8]states the 
corresponding expressions for the tensor approximation problem, which are defined 
on a product of Grassmannians. The symmetric case is detailed in section [9] Sec- 
tion [TO] contains a few examples with numerical calculations illustrating the presented 
concepts. The implementation and the experimental results are found in Section [TTj 
Related work and the conclusions are discussed in Section [12] and [13] respectively. 

1.4. Notations. Tensors will be denoted by calligraphic letters, e.g. A, B, C. 
Matrices will be denoted in upper case letters, e.g. X, Y, Z. We will also use upper- 
case letters X, Xk to denote iterates or elements of a Grassmannian since we represent 
them as (equivalence classes of) matrices with orthonormal columns. Vectors and 
iterates in vector form are denoted with lower case letters, e.g. x, y, Xk, J/fc, where the 
subscript is the iteration index. To denote scalars we use lower case Greek letters, 
e.g. a, j3, and t, t k . 
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We will use the usual symbol ® to denote the outer product of tensors and a large 
boldfaced version (g> to denote the Kronecker product of operators. For example, if 
A G M mxn and B G M px? are matrices, then A ® B will be a 4-tensor in m™ x ™ x px<3 
whereas A® B will be a matrix in jj m P x ™9. i n the former case, we regard A and B as 
2-tensors while in the latter case, we regard them as matrix representations of linear 
operators. The contracted products used in this paper are defined in Appendix |A} 

For r < n, we will let 0(n,r) = {X G R nXr \ X T X = 1} denote the Stiefel 
manifold of n x r matrices with orthonormal columns. The special case r = n, i.e. 
the orthogonal group, will be denoted O(n). For r < n, O(r) acts on 0(n,r) via 
right multiplication. The set of orbit classes 0(n, r)/ O(r) is a manifold called the 
Grassmann manifold or Grassmannian (we adopt the latter name throughout this 
article) and will be denoted Gr(n, r). 

In this paper, we will only cover a minimal number of notions and notations 
required to describe our algorithm. Further mathematical details concerning tensors, 
tensor ranks, tensor approximations, as well as the counterpart for symmetric tensors 
may be found in [8, 20j [25]. Specifically we will use the notational and analytical 
framework for tensor manipulations introduced in [25l Section 2] and assume that 
these concepts are familiar to the reader. 

2. General and symmetric tensors. Let V\,.. . ,Vk be real vector spaces of 
dimensions ni , . . . , respectively and let A be an element of the tensor product 
V\ <g> . . . <g> Vk, i.e. A is a tensor of order k [301 [Ml 033 ■ Up to a choice of bases 
on V\, . . . ,Vk, one may represent a tensor A as a fc-dimensional hypermatrix A = 
ijj G K™i x ' x "fc. Similarly, let V be a real vector space of dimension n and 
S G S k (V) be a symmetric tensor of order k |3Ql [39l [56| . Up to a choice of basis on V, 
S may be represented as a fc-dimensional hypermatrix S = [s il ... ik ] G ]R nx " xn whose 
entries are invariant under any permutation of indices, i.e. 



5l <r(l)"-Mfc) 



for every a G &k (2.1) 



We will write S (K n ) for the subspace of E nx ' " x " satisfying (2.1). Henceforth, we 
will assume that there are some predetermined bases and will not distinguish between 
a tensor A G V\ ® . . . ® V% and its hypermatrix representation A G M™ 1 *"' Xflk and 
likewise for a symmetric tensor S G S k (V) and its hypermatrix representation S G 
S fe (IR"). Furthermore we will sometimes present our discussions for the case k = 3 
for notational simplicity. We often call an order-fc tensor simply as a fc-tensor and an 
order-A: symmetric tensor as a symmetric fc-tensor. 

As we have mentioned in Section[T] symmetric tensors are common in applications, 
largely because of the two examples below. The use of higher-order statistics in signal 
processing and neuroscience, most notably the technique of independent component 
analysis, symmetric tensors often play a central role. The reader is referred to [S] for 
further discussion of symmetric tensors. 

Example 2.1. Let m G {1,2, ... ,00} and Vl C K™ be an open subset. If f G 
C m {VL), then for k = 1, . . . , m, the kth derivative of f at a G Q is a symmetric tensor 
of order fc, 



Qkf 

' "(a) 



G S fe (K n ). 

iiH \-i n =k 



_dx\ 1 ...dxlr 

For fc = 1,2, the vector D 1 /(a) and the matrix D 2 /(a) are the gradient and Hessian 
of f at a, respectively. 

Example 2.2. Let X\, . . . , X n be random variables with respect to the same 
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probability distribution \x. The moments and cumulants of the random vector X = 
(Xi, . . . , X n ) are symmetric tensors of order k defined by 



m k (X) = [E(x h x l2 ■ ■ ■ Xi k )] n iu „. 4k=l 



ii,...,i fc =l 



Kfc(X) 



={*!, — ,**} 



respectively. The sum above is taken over all possible partitions . . . , i^} = A± U 
• • • U Ap. It is not /iarrf to s/iow i/iai both mfe(X) arid Kfe(X) € S fc (R™). for n = 1, the 
quantities ftfc(X) /or fc = 1,2,3,4 /lave well-known names: they are the expectation, 
variance, skewness, and kurtosis of the random variable X , respectively. 

3. Multilinear transformation and multilinear rank. Matrices can act on 
other matrices through two independent multiplication operations: left-multiplication 
and right-multiplication. If C £ R pxq and X £ R mxp , Y £ R nxq , then the matrix C 
may be transformed into the matrix A £ K mx ™, by 



A = XCY T , 



^2 z2 ZiotVjpCap ■ 

a=l fl=l 



(3.1) 



Matrices act on order-3 tensors via three different multiplication operations. As in 
the matrix case, these can be combined into a single formula. If C £ R pxqxr and 
X £ R lxp , Y £ R mx i, Z £R nxr , then the 3-tensor C may be transformed into the 
3-tensor A £ 



pZxmxn 



p q r 
a=l 0=1 7=1 



(3.2) 



We call this operation the trilinear multiplication of A by matrices X, Y and Z, which 
we write succinctly as 



A= (X,Y,Z) -C. 



(3.3) 



This is nothing more than the trilinear equivalent of (3.1), which in this notation has 
the form 

A = XCY T = (X, Y) ■ C. 



Informally, (3.2 1 amounts to multiplying the 3-tensor A on its three 'sides' or modes 
by the matrices X, Y , and Z respectively. 



An alternative but equivalent way of writing (3.2 1 is as follows. Define the outer 
product of vectors x £ R l , y £ R m , z £ R n by 



x ®y ® z = [xtyjZk} £ 
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p/xmxn 



and call a tensor of the form x®y®z a decomposable tensor or, if non-zero, a rank-1 
tensor. One may also view (3.2) as a trilinear combination (as opposed to a linear 
combination) of the decomposable tensors given by 



A = ^^^c Q( 3 7 x Q (g)j/ / 3(g)z 7 . (3.4) 

a=l p=l 7=1 

The vectors xi,...,x p G R l , yi,...,y q G l m , Zi,...,z r G M" are, of course, the 
column vectors of the respective matrices X, Y, Z above. In this article, we find it 



more natural to present our algorithms in the form (3.3) and therefore we refrain from 



using (3.4) 



More abstractly, given linear transformations of real vector spaces T u : U —t U', 
T v : V -> V, T w ; W W, the functoriality of tensor product HUGS] (denoted by 
the usual notation (g> below) implies that one has an induced linear transformation 
between the tensor product of the respective vector spaces 

T u ® T v ® T w : U ® V ® W -> U' ® V ® W. 

The trilinear matrix multiplication above is a coordinatized version of this abstract 
transformation. In the special case where U = U' , V = V , W = W and T u , T v , 



T w are (invertible) change-of-basis transformations, the operation in (3.2) describes 
the manner a (contravariant) 3-tensor transforms under change-of-coordinates of U, 
V and W. 



Associated with the trilinear matrix multiplication (3.3) is the following notion 
of tensor rank that generalizes the row-rank and column-rank of a matrix. Let A G 
R lxmx ". For fixed values of j G {1, . . . , m} and k G {1, . . . , n}, consider the column 
vector, written in a MATLAB-like notation, A(:,j, k) G M. 1 . Likewise we may consider 
A(i, :, k) G W 71 for fixed values of i, k, and A(i, j, :) G K™ for fixed values of i, j. Define 



n(A) 

T2{A) 

rz(A) 



dim(span{yl(:, j, k) | 1 < j < m, 1 < k < n}), 
dim(span{^(i, :, k) | 1 < i < 1, 1 < k < n}), 
dim(span{„4(i, j, :) | 1 < i < 1, 1 < j < to}). 



Note that M. lxmxn ma y also be viewed as U. lxmn . Then ri(A) is simply the rank of A 
regarded as an I x mn matrix, see the discussion on tensor matricization in |25j with 
similar interpretations for r^{A) and r^{A). The multilinear rank of A is the 3-tuple 
(ri(A),r2(A),r 3 (A)) and we write 

rank(^) = (r 1 (A) > r 2 (A),r 3 (A)). 

We need to 'store' all three numbers as r\(A) ^ r 2 (A) ^ r 3 (A) in general — a clear 
departure from the case of matrices where row-rank and column-rank are always 
equal. Note that a rank-1 tensor must necessarily have multilinear rank (1,1,1), 
i.e. rank(x ® y ® z) = (1, 1, 1) if x, y, z are non-zero vectors. 

For symmetric tensors, one would be interested in transformation that preserves 
the symmetry. For a symmetric matrix C G S 2 (M n ), this would be 

r r 

S = XCX T , Sjj = } } XigXjffCgff . (3.5) 

a=l 0=1 
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Matrices act on symmetric order-3 tensors S £ S 3 (K ra ) via the symmetric version 
of pTf I 



r r r 

$ijk — ^ ^ ^ Xi a X j pXk^C a f3^ ■ (*^*^) 
a = l /3=1 7=1 

We call this operation the symmetric trilinear multiplication of S by matrix X, which 
in the notation above, is written as 



S= (X,X,X)-C. 



This is nothing more than the cubic equivalent of ( 3.5 ) , which in this notation becomes 

S = XCX T = (X,X)-C. 



Informally, (3.6 1 amounts to multiplying the 3-tensor A on its three 'sides' or modes 
by the same matrix X. In the multilinear combination form, this is 



S = c Q( 3 7 x Q ® xp ® x 1 

a=l /3=1 7=1 



where the vectors x\, . . . , x r £ 1™ are the column vectors of the matrix X above. 

More abstractly, given a linear transformation of real vector spaces T : V —> V , 
the functoriality of symmetric tensor product |30l 139] implies that one has an induced 
linear transformation between the tensor product of the respective vector spaces 

S fe (T) : S k (V) -> S k (V). 



The symmetric trilinear matrix multiplication above is a coordinatized version of this 
abstract transformation. In the special case where V = V and T is an (invertible) 
change-of-basis transformation, the operation in (3.6) describes the manner a sym- 
metric 3-tensor transforms under changc-of-coordinates of V . 
For a symmetric tensor S £ S 3 (K n ), we must have 



n(S) = r 2 (S) = r 3 (S) 

by its symmetry. See Lemma |9.1| for a short proof. The symmetric multilinear rank of 
iS is the common value, denoted rs{S). When referring to a symmetric tensor in this 
article, rank would always mean symmetric multilinear rank; e.g. a rank-s symmetric 
tensor S would be one with rs(<S) = s. 

We note that there is a different notion of tensor rank and symmetric tensor rank, 
defined as the number of terms in a minimal decomposition of a tensor (resp. sym- 
metric tensor) into rank-1 tensors (resp. rank-1 symmetric tensors). Associated with 
this notion of rank are low-rank approximation problems for tensors and symmetric 
tensors analogous to the ones discussed in this paper. Unfortunately, these are ill- 
posed problems that may not even have a solution [8l I20j . As such, we will not discuss 
this other notion of tensor rank. It is implicitly assumed that whenever we discuss 
tensor rank or symmetric tensor rank, it is with the multilinear rank or symmetric 
multilinear rank defined above in mind. 



8 



3.1. Multilinear approximation as maximization over Grassmannians. 

Let _Ae]R /xmx ™bea given third order tensor and consider the problem 

min{||^-S|| F | rank(B) < (p,q,r)}. 
Under this rank constraint, we can write B in factorized form 

pgr 

B = (X,Y,Z)-C, bjjk XiayjpZk-yCgfr, (3.7) 

a=l P=l 7=1 

where C e W x i xr and X € M ixp , Y G R rox «, Z e M" xr , are full rank matrices. In 
other words, one would like to solve the following best multilinear rank approximation 
problem, 

mm \\A-(X,Y,Z).C\\ F . (3.8) 

This is the optimization problem underlying the Tucker model 54 that originated in 
psychometrics but has become increasingly popular in other areas of data analysis. 
In fact, there is no loss of generality if we assume X T X = I, Y T Y = 1 and Z J Z = I. 
This is verified as follows, for any full column-rank matrices X, Y and Z we can 
compute their QR-factorizations 

X = XR X , Y = YR Y , Z = ZR z 

and multiply the right triangular matrices into the core tensor, i.e. 

\\A-(X,Y,Z)-C\\ F = \\A-(X,Y,Z)-C\\ F where C = {R x , Ry, Rz) • C . 

With the orthonormal constraints on X ,Y and Z the tensor approximation problem 
can be viewed as an optimization problem on a product of Stiefel manifolds. Using 
the identity 

(C/ T , U T , W T ) -A = A-(U,V, W), 

we can rewrite the tensor approximation problem as a maximization problem with 
the objective function 

$(X,Y,Z) = ~\\A-(X,Y,Z)\\ 2 F s.t. X T X = I 1 Y T Y = I. Z T Z = I, (3.9) 

in which the small core tensor C is no longer present. See references [16l [25] for the 
elimination of C. The objective function $(X, Y, Z) is invariant under orthogonal 
transformation of the variable matrices X, Y, Z from the right. Specifically, for any 
orthogonal matrices Q\ G 0(p), Q 2 G O(q) and Q 3 G O(r) it holds that Y, Z) = 
$(XQi, YQ2, ZQ3). This homogeneity property implies that Q(X, Y, Z) is in fact 
defined on a product of three Grassmannians Gr(/,p) x Gr(m, q) x Gr(n,r). 
Let S G S 3 (M™) be a given symmetric 3-tensor and consider the problem 

min{||5 - T\\f I T G S 3 (E"), r s (S) < r}. 

Under this rank constraint, we can write T in factorized form 

r r r 

T = (X, X, X) ■ C, tijk = ^ ^ ^ XiaXjpXkjCapj, 

a=l j3=l 7=1 
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where C £ S 3 (K r ) and X £ W lXr is a full-rank matrix. In other words, one would like 
to solve the following best symmetric multilinear rank approximation problem, 

mm{\\A-(X,X,X)-C\\ F \ X £ 0(n,r), C £ S 3 (M r )}. 

As with the general case, there is no loss of generality if we assume X T X = I. With 
the orthonormal constraints on X, the tensor approximation problem can be viewed 
as an optimization problem on a single Stiefel manifold (as opposed to a product of 



Stiefel manifolds in (3.9 1). Using the identity 

(V T ,V T 1 V T )-S = S-(V,V,V), 

we may again rewrite the tensor approximation problem as a maximization problem 
with the objective function 

*(X) = ±\\S>(X,X,X)f F s.t. X T X = L 

in which the core-tensor C is no longer present. As with the general case, the ob- 
jective function $(X) also has an invariance property, namely &(X) — $>(XQ) for 
any orthogonal Q £ O(r). As before, this homogeneity property implies that $(X) is 
well-defined on a single Grassmannian Gr(n, r). 

These multilinear approximation problems may be viewed as 'dimension reduc- 
tion' or 'rank reduction' for tensors and symmetric tensors respectively. In general, a 
matrix requires 0(n 2 ) storage and an order-A; tensor requires 0(n k ) storage. While 
it is sometimes important to perform dimension reduction to a matrix, a dimen- 
sion reduction is almost always necessary if one wants to work effectively with a 
tensor of higher order. A dimension reduction of a matrix A £ W ixn of the form 
A w UCV T , where U, V £ 0(n, r) and diagonal C £ W xr reduces dimension from 
0(n 2 ) to 0(nr + r). A dimension reduction of a tensor A £ R nx '" xn of the form 
A~ (Qi, . . . , Qk) -C where Qi, ■ ■ ■ ,Qk £ 0(n, r) and C £ M rx '" xr , reduces dimension 
from 0(n k ) to 0(knr + r k ). If r is significantly smaller than n, e.g. r = 0(n 1 / fc ), then 
a dimension reduction in the higher order case could reduce problem size by orders of 
magnitude. 

4. Optimization in Euclidean space and on Riemannian manifolds. In 

this section we discuss the necessary modifications for generalizing an optimization 
algorithm from Euclidean space to Riemannian manifolds. Specifically we consider 
the quasi-Newton methods with BFGS and limited memory BFGS (l-BFGS) updates. 
First we state the expressions in Euclidean space and then we point out what needs 
to be modified. The convergence properties of quasi-Newton methods defined on 
manifolds were established by Gabay [37] . Numerical treatment of algorithms on the 
Grassmannian are given in [2H [H 02 A recent book on optimization on 

manifolds is [2]. In this and the next three sections, i.e. Sections [4] through [7j we will 
discuss our algorithms in the context of minimization problems, as is conventional. 
It will of course be trivial to modify them for maximization problem. Indeed the 
tensor approximation problems discussed in Sections [8] through [TT] will all be solved 
as maximization problems on Grassmannians or product of Grassmannians. 

4.1. BFGS updates in Euclidean Space. Assume that we want to minimize a 
nonlinear real valued function f(x) where x £ R n . As is well-known, in quasi-Newton 
methods, one solves 



H k Pk = ~9k, 
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(4.1) 



to obtain the direction of descent p k from the current iterate x k and the gradient 
9k = V/(xfc) at x k . Unlike Newton method, which uses the exact Hessian for H k , in 



(4.1 ) H k is only an approximation of the Hessian at x k . After computing the (search) 
direction p k one obtains the next iterate as Xk+i = x k + tkPk in which the step length 
tk is usually given by a line search method satisfying the Wolfe or the Goldstein 
conditions |46j . Instead of recomputing the Hessian at each new iterate Xk+i, it is 
updated from the previous approximation. The BFGS update has the following form, 

H k s k s T k H k y k yT 

H k+1 - H k j— h —- p — , (,4.2) 

s k H k s k y k s k 

where 

s k = x k+ i -x k = t k Pk, (4.3) 
y k =9k+\-9k- (4.4) 

Quasi-Newton methods with BFGS updates are considered to be the most compu- 
tationally efficient algorithms for minimization of general nonlinear functions. This 
efficiency is obtained by computing a new Hessian approximation as a rank-2 modifi- 
cation of the previous Hessian. The convergence of quasi-Newton methods is super- 
linear in a vicinity of a local minimum. In most cases the quadratic convergence of 
the Newton method is outperformed by quasi-Newton methods since each iteration is 
computationally much cheaper than a Newton iteration. A thorough study of quasi- 
Newton methods may be found in [46 , 23 . The reader is reminded that quasi-Newton 
methods do not necessarily converge to local minima but only to stationary points. 



Nevertheless, using the Hessians that we derived in Sections 8.1 and 9.3 (for the gen- 
eral and symmetric cases respectively), the nature of these stationary points can often 
be determined. 

4.2. Quasi-Newton methods on a Riemannian manifold. We will give a 
very brief overview of Riemannian geometry tailored specifically to our needs in this 
paper. First we sketch the modifications that are needed in order for an optimization 
algorithm to be well-defined when the objective function is defined on a manifold. 
For details and proof, the reader should refer to standard literature on differential 
and Riemannian geometry [TTJ [S] [2] • Informally a manifold, denoted with M, is an 
object locally homeomorphic to M. n . We will regard M as a submanifold of some 
high-dimensional ambient Euclidean space R . Our objective function / : M — > K 
will be assumed to have (at least) continuous second order partial derivatives. We 
will write f(x), g{x), and H(x) for the value of the function, the gradient, and the 
Hessian at x G M. These will be reviewed in greater detail in the following. 



Equations (4.1)-(4.4) are the basis of any algorithmic implementation involving 
BFGS or l-bfgs updates. The key operations are (1) computation of the gradient, 
(2) computation of the Hessian or its approximation, (3) subtraction of iterates, e.g. 
to get s k or x k+ i and (4) subtraction of gradients. Each of these points needs to be 
modified in order for these operations to be well-defined on manifolds. 

Computation of the gradient. The gradient g(x) — gradf(x) — V/(x) at x G M 
of a real-valued function defined on a manifold, / : M — > R, M 9 x i— > f(x) G R, is 
a vector in the tangent space T x = T X (M) of the manifold at the given point x. We 
write g G T x . 

To facilitate computations, we will often embed our manifold M in some am- 
bient Euclidean space and in turn endow M with a system of global coordi- 
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nates (x\, . . . ,xn) where N is usually larger than d := dim(M), the intrinsic dimen- 
sion of M. The function / then inherits an expression in terms of x\,... ,x^, say, 
f(x) = f(x\, . . . , Xn). We would like to caution our readers that computing V/ on M 
is not a matter of simply taking partial derivatives of / with respect to X\, . . . , x^. An 
easy way to observe this is that V/ is a d-tuple when expressed in local coordinates 
whereas (df/dxi, . . . ,df /dxj^j is an TV-tuple. 

Computation of the Hessian or its approximation. The Hessian H(x) — Hess f(x) = 
W 2 f(x) at x E M of a function / : M —¥ R is a linear transformation of the tangent 
space T x to itself, i.e. 

H(x) : T, -»■ T, . (4.5) 

As in the case of gradient, when / is expressed in terms of global coordinates, differ- 
entiating the expression twice will in general not give the correct Hessian. 

Updating the current iterate. Given an iterate Xk € M, a step length t^ and a 
search direction pk G T Xk the update Xk+i — x^ + tkVk will in general not be a point 
on the manifold. The corresponding operation on a manifold is to move along the 
geodesic curve of the manifold given by the direction pk- Geodesies on manifolds 
correspond to straight lines in Euclidean spaces. The operation = Xk+i — Xk 
is undefined in general when the points on the right hand side belong to a general 
manifold. 

Updating vectors and operators. The quantities Sk and yk are in fact tangent 
vectors and the Hessians (or Hessian approximations) are linear operators all defined 
at a specific point of the manifold. A given Hessian H(x) is only defined at a point 
x G M and correspondingly only acts on vectors in the tangent space T x . In the 



right hand side of the BFGS update (4.2 ) all quantities need to be defined at the same 
point Xk in order to have well-defined operations between the terms. In addition the 
resulting sum will define an operator at a new point Xk+i- The notion of parallel 
transporting vectors along geodesic curves resolves all of these issues. The operations 
to the Hessian are similar and involves parallel transport operation back and forth 
between two different points. 

5. Grassmann geodesies and parallel transport of vectors. In this paper, 
the Ricmannian manifold M of most interest to us is Gr(n, r), the Grassmannian or 
Grassmannian of r-planes in R n . Our discussion will proceed with M = Gr(n,r). 

5.1. Algorithmic considerations. Representing points on Grassmannians as 
(equivalence classes of) matrices allows us to take advantage of matrix arithmetic and 
matrix algorithms, as well as the readily available libraries of highly optimized and 
robust matrix computational softwares developed over the last five decades [31 00] ■ A 
major observation of |24j is the realization that common differential geometric oper- 
ations on points of Grassmann and Stiefel manifolds can all be represented in terms 
of matrix operations. For our purposes, the two most important operations are (1) 
the determination of a geodesic at a point along a tangent vector and (2) the parallel 
transport of a tangent vector along a geodesic. On a product of Grassmannians, these 
operations may likewise be represented in terms of matrix operations [25]. We will 
give explicit expressions for geodesic curves on Grassmannians and two different ways 
of parallel transporting tangent vectors. 

5.2. Grassmannians in terms of matrices. First we will review some pre- 
liminary materials from |24j . A point (X) on the Grassmannian Gr(n,r) is an equiv- 
alence class of orthonormal matrices whose columns form an orthonormal basis for an 
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r-dimcnsional subspace of W 1 . Explicitly, we write 



(X) = {XQeO(n,r)\QeO(r)}, 

where X £ 0(n, r), i.e. X is an n x r matrix and X T X = I r . The set 0(n, r) = {X £ 
R nxr | X 1 X = I r } is also a manifold, often called the Stiefel manifold. When n = r, 
0(r, r) — O(r) is the orthogonal group. It is easy to see that the dimensions of these 
manifolds are 

dim(0(r)) = ^r(r — 1), dim(0(n, r)) = nr — ^r(r + 1), dim(Gr(n, r)) = r(n — r). 

In order to use standard linear algebra in our computations, we will not be able to 
work with a whole equivalence class of matrices. So by a point on a Grassmannian, we 
will always mean some X £ (X) that represents the equivalence class. The functions / 
that we optimize in this paper will take orthonormal matrices in 0(n, r) as arguments 
but will always be well-defined on Grassmannians: given two representatives Xi 7 X 2 £ 
(X), we will have f{X{) = f(X 2 ), i.e. 

f(XQ) = f(X) for every Q £ O(r). 

Abusing notations slightly, we will sometimes write / : Gr(n, r) — > K. 

The tangent space Tx , where X £ Gr(n, r), is an affine vector space with elements 
in R nxr . It can be shown that any element A £ T x satisfies 

X T A = 0. 



The projection on the tangent space is 

Tlx = I-XX T = X ± Xl, 

where X± is an orthogonal complement of X, i.e. the square matrix [X X±] is an n x n 
orthogonal matrix. Since by definition X 1 ~Xj_ = 0, any tangent vector can also be 
written as A = X±D where D is an (n — r) x r matrix. This shows that the columns 
of X± may be interpreted as a basis for Tx- We say that A is a global coordinate 
representation and D is a local coordinate representation of the same tangent. Note 
that the number of degrees of freedom in D equals the dimension of the tangent 
space Tx , which is r(n — r). It follows that for a given tangent in global coordinates 
A, its local coordinate representation is given by D — Xj_A. Observe that to a 
given local representation D of a tangent there is an associated basis matrix X±. 
Tangent vectors are also embedded in K nr since in global coordinates they are given 
bynxr matrices. We will define algorithms using both global coordinates as well 
as intrinsic local coordinates. When using global coordinates, the Grassmannian 
Gr(n,r) is (isometrically) embedded in the Euclidean space M nxr and a product of 
Grassmannians in a corresponding product of Euclidean spaces. The use of Pliickcr 
coordinates to represent points on Grassmannian is not useful for our purpose. 

5.3. Geodesies. Let X £ Gr(n, r) and A be a tangent vector at X, i.e. A £ Tx- 

The geodesic path from X in the direction A is given by 



X(t) = [XV U] 
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cos Tit 
sin St 



V\ (5.1) 



where A = UY^V T is the thin SVD and we identify X(0) = X. Observe that omitting 
the last V in (5.1| will give the same path on the manifold but with a different^ 
representation. This information is useful because some algorithms require a consis- 
tency in the matrix representations along a path but other algorithms do not. For 
example, in a Newton-Grassmann algorithm we may omit the second V [25] but in 
quasi-Newton-Grassmann algorithms V is necessary. 

5.4. Parallel transport in global and local coordinates. Let X be a point 
on a Grassmannian and consider the geodesic given by the tangent vector A £ Tj. 
The matrix expression for the parallel transport of an arbitrary tangent vector A2 £ 
Tx is given by 



Tx(t) ^ A 2 (i) = [XV U] 



— sin Y,t 
cos Si 



U T + (I-UU J )^A 2 = T x . A (t)A 2 , (5.2) 



where A = UY,V T is the thin SVD and we define T x ^{t) to be the parallel transport 
matriJ^ from the point X in the direction A. If A2 = A expression (5.2) can be 
simplified. 

Let X £ Gr(n, r), A <E Tx, and X± be an orthogonal complement of X so that 
[X X±] is orthogonal. Recall that we may write A = X±D, where we view X± as 
a basis for Tx and D as a local coordinate representation of the tangent vector A. 



Assuming that X(t) is the geodesic curve given in (5.1 1, the parallel transport of the 
corresponding basis X±(t) for T X (t) is given by 



X±(t)=T XA (t)X ± , 



(5.3) 



where Tx,A(t) is the transport matrix defined in (5.2). It is straightforward to show 



that the matrix [X(t) X±(t)] is orthogonal for all t, i.e. 

Xl(t)Xj_(t) = I n _ r and Xl(t)X(t) = for every t. 



Using (5.3) we can write the parallel transport of a tangent vector A 2 as 



A 2 (i) = T{t)A 2 = T(t)X x D 2 = X ± {t)D 2 



(5.4) 



Equation (5.4) shows that the local coordinate representation of the tangent vector is 
constant at all points of the geodesic path X(t) when the basis for Txit) is given by 
X±(t). The global coordinate representation, on the other hand, varies with t. This is 
an important observation since explicit parallel transport of tangents and Hessians (cf. 



Section 6.5) can be avoided if the algorithm is implemented using local coordinates. 
The computational complexity for these two operation^ are 0(n 2 r) and 0(n 3 r 2 ) 
respectively. The cost saved in avoiding parallel transports of tangents and Hessians 
is paid instead in the parallel transport of the basis X±. This matrix is computed in 
the first iteration at a cost of at most 0(n 3 ) operations, and in each of the consecutive 
iterations, it is parallel transported at a cost of 0(n 2 (n — r)) operations. There are 
also differences in memory requirements: In global coordinates tangents are stored 
as n x r matrices and Hessians as nr x nr matrices, whereas in local coordinates 
tangents and Hessians are stored as (n — r) x r and (n — r)r x (n — r)r matrices 



1 A given matrix representation of a point on a Grassmannian can be postmultiplied by any 
orthogonal matrix, giving a new representation of the same point. 

2 We will often omit subscripts X and A and just write T(t) when there is no risk for confusion. 
3 Here we assume that the parallel transport operator has been computed and stored. 
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respectively. Local coordinate implementation also requires the additional storage of 
X± as an n x [n — r) matrix. In most cases, the local coordinate implementation 
provides greater computational and memory savings, as we observed in our numerical 
experiments. 



By introducing the thin SVD of D = lfSV T , we can also write (5.3) as 



X x (t) = T{t)X x = [XV X X U] 



— sin £i 
cos £t 



X, I 



UW) 



This follows from the identities 



u = x\u, E = £, V = V, 

which are obtained from A = UYjV t = X±D. Using this, we will derive a general 
property of inner products for our later use. 

Theorem 5.1. Let X G Gr(n,r) and A,Ai,A 2 G T X - Define the transport 
matrix in the direction A 



T x>A {t)=T(t)=[XV U] 



where A = UEV T is the thin SVD. The 



— sin Si 
cos £i 



U T + (I -UU T ), 



(Ai ) A 2 ) = <Ai(t),A 2 (t)> for every t, 

where Ai(t) = Tx,a(£)Ai and Aa(t) = 1x^(0^2 we parallel transported tangents. 

Proof. The proof is a direct consequence of the Levi-Civita connection used in 
the definition of the parallel transport of tangents. Or we can use the canonical inner 
product on the Grassmannian (Ai,A2) = tr(A|A2). Then, inserting the parallel 
transported tangents we obtain 

(A 1 (t),A 2 (t)) = tr(A 1 (t) T A 2 (t)) = tr(A[T(t) T T(i)A 2 ) 

= tr (Aj (I-U sin(St) V T X T - XV sin(£i)[/ T ) A 2 ) 

= tr (A}A 2 ) - tr (AjU sin(Si)V" T X T A 2 ) - tr (AjXV sin(Zt)U T A 2 ) . 

The proof is concluded by observing that the second and third terms after the last 
equality are zero because X T A 2 = and AjX = 0. □ 

Remark. We would like to point out that it is the tangents that are parallel 
transported. In global coordinates tangents are represented by n x r matrices and 
their parallel transport is given by (5.2). On the other hand, in local coordinates, 
tangents are represented by (n — r) x r matrices and this representation does not 



change when the basis for the tangent space is parallel transported according to (5.3 1 



In other words, in local coordinates, parallel transported tangents are represented by 
the same matrix at every point along a geodesic. This is to be contrasted with the 
global coordinate representation of points on the manifold Gr(n, r), which are n x r 
matrices that differ from point to point on a geodesic. 

6. Quasi-Newton methods with BFGS updates on a Grassmannian. In 

this section we will present the necessary modifications in order for BFGS updates to 
be well-defined on a Grassmannian. We will write f{X) instead of f(x) since the 
argument to the function is a point on a Grassmannian and represented by a matrix 
X = \xii\ 1 ]''I^ G M" xr . Similarly the quantities Sk and yu from equations (4.3| and 



(4.4) will be written as matrices Sk and Yfe, respectively. 
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6.1. Computations in global coordinates. We describe here the expressions 
of various quantities required for defining BFGS updates in global coordinates. The 
corresponding expressions in local coordinates are in the next section. 

Gradient. The Grassmann gradient of the objective function f(X) is given by 



v/(x) = n 



A' 



df_ 

dX : 



dX 



Of 



dxi 



(6.1) 



where Ux = I — XX T is the projection on the tangent space Tx- 

Computing Sk- We will now modify the operations in equation (4.3), i.e. 



Xk+l - Xk = tffPk, 



so that it is valid on a Grassmannian. Let Xk+i be given by X k+ i = Xk(tk) where 
the geodesic path originating from Xk is defined by the tangent (or search direction) 
A 6 Tx t . The step size is given by tk- We will later assume that Sk G Tx t+1 and 
with the tangent A 6 Tx k , corresponding to p k , we conclude that 



S k = t fe A(t fe ) = t k T(t k )A, 



(6.2) 



where T(t k ) is the transport matrix defined in (5.2) 
Computing Y k . Similarly, we will translate 



Vk = 9k+i - 9k = Vf(x k+ i) - V/(x fc ) 



from equation (4.4). Computing the Grassmann gradient at Xk+\ we get V f(X k+ i) G 
Tx i+1 . Parallel transporting Vf(X k ) G Tj t along the direction A and subtracting 
the two gradients as in equation ( 4.4 ) we get 



T Xfc+1 3Y k = Vf(X k+1 ) - T(tk)Vf(X k ), 



(6.3) 



where we again use the transport matrix (5.2). Recall that Yk corresponds to yk- 

The expressions for V/, Sk and Yk are given in matrix form, i.e. they have the 
same dimensions as the variable matrix X. It is straightforward to obtain the corre- 
sponding vectorized expressions. For example, with df/dX G M" xr , the vector form 
of the Grassmann gradient is given by 



vec(V/) = (I r <8> U x ) vec 



df_ 
dX 



where vec(-) is the ordinary column-wise vectorization of a matrix. For simplicity we 
switch to this presentation when working with the Hessian. 

Updating the Hessian (approximation). Identify the tangents (matrices) A G 
jgmxr w jth vectors in M. nr and assume that the Grassmann Hessian 



pnrxnr 



3H k = H(X k ) : T Xk -> T Xk 
at the iterate Xk is given. Then 

H k = (I r ® T(t k ))H k (I r ® f (t k )) : T Xk+1 ->■ T Xk+1 , 



(6.4) 



is the transported Hessian defined at iterate Xk+i- As previously T(tk) is the trans- 
port matrix from Xk to X k+ i given in (5.2) and T(t k ) is the transport matrix from 
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Xk+i to Xk along the same geodesic path. Informally we can describe the operations 
in (6.4) as follows. Tangent vectors from Tx k+1 are transported with T(tfc) to Tx t 



on which Hk is defined. The Hessian Hk transforms the transported vectors on Tx k 
and the result is then forwarded with T(tk) to Tx k+1 - 

Since all vectors and matrices are now defined at X k +i, the BFGS update is com- 



puted using equation (4.2) in which we replace Hk with (I r (8> T(tk))H k (I r ® T(tk)) 
and use Sk = vec(Sk) and y k = vec(Yfc) from equations (6.2) and (6.3) respectively. 



6.2. Computations in local coordinates. Using local coordinates we obtain 
several simplifications. First given the current iterate X we need the orthogonal 
complement X±. When it is obvious we will omit the iteration subscript k. 

Grassmann gradient. In local coordinates the Grassmann gradient is given by 



wf=xlwf = xlu 



x 



df_ 
dX 



X 



^dx' 



(6.5) 



where we have used the global coordinate representation for the Grassmann gradient 



(6.1 1 . We denote quantities in local coordinates with a hat to distinguish them from 



those in global coordinates. 

Parallel transporting the basis X± . It is necessary to parallel transport the basis 
matrix X± from the current iterate X k to the next iterate X^+i- Only in this basis 
will the local coordinates of parallel transported tangents be constant. The parallel 



5.4 



in the trans- 



transport of the basis matrix is given by equation (5.3). 

Computing Sk and Yk- According to the discussion in Section 
ported tangent basis X±(t), the local coordinate representation of any tangent is 
constant. Specifically this is true for Sk and Y k . The two quantities are obtained with 
the same expressions as in the Euclidean space. 

Updating the Hessian (approximation) . Since explicit parallel transport is not re- 
quired in local coordinates, the Hessian remains constant as well. The local coordinate 
representations for Hk in the basis X±(t) for points on the geodesic path X(t) are 



the same. This statement is proven in Theorem 6.4 



The effect of using local coordinates on the Grassmannian is only in the geodesic 
transport of the current point Xk and its orthogonal complement Xk±. The trans- 
ported orthogonal complement X k ±(tk) is used to compute the Grassmann gradient 
V/(Xfe_|_i) in local coordinates at the new iterate Xk+\ = Xk(tk)- Assuming tangents 
are in local coordinates at Xk in the basis Xk± and tangents at Xk+i are given in the 



basis Xk±(tk), the BFGS update is given by (4.2), i.e. exactly the same update as in 



the Euclidean space. This is a major advantage compared with the global coordinate 
update of Hk- In global coordinates Hk is multiplied by matrices from the left and 



from the right (6.4). This is relatively expensive since the BFGS update itself is just a 



rank-2 update, see equation (4.2). 



6.3. BFGS update in tensor form. It is not difficult to see that if the gradient 
V/(X) is written as an n x r matrix, then the second derivative will take the form 
of a 4-tensor Hk G ^nxrxnxr^ rpj^ BFGS upc iate (|4.2|) can be written in a different 



form using the tensor structure of the Hessian. The action of this operator will map 
matrices to matrices. Assuming Sk — vec(Sfe) and Hk is a matricized form of Hk, the 
matrix-vector contraction HkSk can be written as (Hk, <Sfc)i 2- Obviously the result of 
the first operation is a vector whereas the result of the second operation is a matrix, 
and of course H k s k = vec ((Hk, <S*)i,2)- 

17 



Furthermore keeping the tangents, e.g. Sk or Yj., in matrix form, the parallel 



transport of the Hessian in equation (6.4) can be written as a multilinear product 



between H k G ^xrxnxr anc j £ w0 transport matrices T(tk) and T(tk), both in 
K nxra , along the first and third modes, 



p nxrxnxr 



3H=H k -(T(t k ),I,T{t k ),I). 



Finally, noting that the outer product between vectors corresponds to tensor 
products between matrices the BFGS update become^] 



H k+1 =H k + 



(n k ,S k ) h2 ®(H k ,S k ) h2 , Y k ®Y k 



(\Hk, S k )i,2, Sk) 



(S k , Y k ) ' 



(6.6) 



where the matrices S k ,Y k G Tx k+1 are given by (6.2) and (6.3) respectively. 

In local coordinates the update is even simpler since we do not have to parallel 
transport the Hessian operator, 



T~L k +i — T~L k 



(H k , S k )i y2 ® {T-L k ,S k )i,2 Y k ®Y k 
{{T~L k ,S k )i.2-,Sk) (S k ,Y k } 



(6.7) 



n—r) xr 



The hat indicates that the 



where H k € R^-r)xrx( n -r)x r and g k ^y k e R( 
corresponding variables are in local coordinates. 

6.4. BFGS update on a product of Grassmannians. Assume now that the 
objective function / is defined on a product of thre^] Grassmannians, i.e. 

/ : Gr(l,p) x Gr(m, q) x Gr(n,r) -» R, 

and is twice continuously differentiable. We write f(X,Y,Z) where X G Gr(/,p), 
Y G Gr(m, q) and Z G Gr(n, r). The Hessian of the objective function will have 
a 'block tensor' structure but the blocks will not have conforming dimensions. The 
action of a (approximate) Hessian operator on tangents Ax G Tj, Ay G Ty and 
Az G Tz may be written symbolically as 



T~ixx 
Uyx 
Hzx 

(Hxx, 
{Uyx., 

(UZX: 

(Uxx, 
{Hxy, 
{T~Lxz-. 





Ax" 




Ay 




A z 



(Hxy, A 



Y/3A 



1,2 



^xy T~Lxz 
Hyy Hyz 
H-zy Hzz 

Ax)3,4;l,2 

Ax)3,4;l : 2 + (Hyy,Ay) 3j 4 
Ax)3,4;l : 2 + {7-LzYi Ay) 3j4 

Ax)i, 2 + (Hyx, A Y )i,2 + (Hzx,A z )i,2 
Ax)i, 2 + [Hyy,A y )i,2 + {Hzy, Az)i,2 
Ax)i,2 + {T~Lyz, Ay) 1:2 + CHzz, A Z )l, 2 



(6.8) 



(H 

i.2 + (n 

1.2 + (H 



xz 

YZ 



zz 



, Az)3,4;l,2 
, A^)3 i 4 ; i ! 2 
Az)3.4:l,2 



The blocks of the Hessian are 4-tensors and elements of the tangent spaces are matri- 
ces. The result of the operation 3 is a triplet where each element is in the corresponding 
tangent space. For example T-Lxx is an I x p x I x p tensor which acts on the tangent 
matrix Ax of size I x p with the result (Hxx- Ax) 1,2 G Tj. Off diagonal example 



4 The contractions denoted by (■, ■}* are denned in AppendixpU 

5 We assume k = 3 for notational simplicity; generalization ortnese discussions to arbitrary k is 
straightforward . 
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may look as follows, T-Lyz is an m x q x n x r tensor which acts on the tangent ma- 
trix A z of size n x r with the result (Hyz, Az)3,4 ; i,2 = (Hzy,^z)i,2 G Ty. The 
equality in the last step follows from the fact that the n x r x m x q tensor Hzy is a 
permutation of the m x q x n x r tensor T-Lyz- This is expected since for twice contin- 
uously differentiable functions f xy = f yx . But in our case they have different 'shapes'. 
The three tangent spaces Tx, Ty and Tz are interconnected through the Hessian of 



f(X, Y, Z) in the sense that every block in (6.8 ) is a linear operator mapping matrices 
from one tangent space to another tangent space. For example T-Lyx '■ Tx — > Ty and 
Tizx ■ Tx — > Tz- 

The corresponding BFGS in the product manifold case has basically the same 
form as equations (6.6 1 and (|6.7|) where the action of the Hessian on Sk, which will 



be a triplet with an element on each tangent space, is replaced with formulas as in 

product needs to be moc 
(r x ,iY,r z ) then we let 



(6.8 1. Also the tensor/outer product needs to be modified in the obvious way, i.e. if 
A = (A X ,A Y ,A Z ) and T 



A ® T = (A Xl A Yl A z ) ® (Tx, Ty, T z ) (6.9) 

~A X ®T X A x ® Ty A x ® T z 
A Y ®T X AyiglTy A Y ®T Z 
A Z ®T X A Z ®T Y A Z ®T Z 

where the results are conveniently stored in a 'block matrix' whose blocks are tensors 
of different dimensions (possibly nonconforming) . 

6.5. Optimality of BFGS on Grassmannians. The BFGS update in quasi- 
Newton methods is optimal because it is the solution to 



mm 



\H-H k \\ F subject to H = H 



Hsu 



Vk, 



where Sk and are given by (4.3) and (4.4) respectively [46]. For the Euclidean 



case it is immaterial whether H is considered as an abstract operator or explicitly 
represented as a matrix. The final conclusion with respect to optimality is the same — 
it amounts to a rank-2 change of Hk ■ The situation is different when considering the 
corresponding optimality problem on Grassmannians. In particular, a given Hessian 
(or approximate Hessian) matrix Hk considered in a global coordinate representation 
and defined at Xj, £ Gr(n, r) has the following form when parallel transported along 
a geodesic, 

H k = (J P ® T{tk))H k {I r <g> f{tk)). 



This is the same expression as equation ( |6.4[ ). While the Hessian operator should 
not change by a parallel transport to a new point on the manifold, its representation 
evidently changes. This has important numerical and computational ramifications. 
In fact, the global coordinate representation of the Hessian at the previous point is 
usually very different from the global coordinate representation of the transported 
Hessian at the current point. 

Assume now the Hessian matrix (or its approximation) is given in local coordi- 
nates Hk at Xk € Gr(n, r) and let Xk j_ be the associated basis matrix for the tangent 
space. Representation of the parallel transported Hessian will not change if the asso- 
ciated basis matrix Xk±(t) is transported according to (5.3). The updated Hessian 



at the current point is a rank-2 modification of the Hessian from the previous point 
given by the BFGS update. The optimality of BFGS update on Euclidean spaces is 
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with respect to a change in successive Hessian matrices; we will prove that in the 
correct tangent space basis and in local coordinates, the BFGS update is also optimal 
on Grassmannians. 

We now give a self contained proof for this statement. First we will state the 
optimality results for the Euclidean case. The proofs of Theorem|6.1[ Lemma [6~2[ and 



Theorem 6.3 are based on [211 [22] ■ We will then use these to deduce the corresponding 



optimality result on a product of Grassmannians in Theorem |6.6 



Theorem 6.1. Let B e 



y G K™, 0^s€ 



The solution to 



mm{\\A-B\\ F | A G 



\ As = y} 



is given by 



B = B 



(y - Bs)s 1 



Proof. Note that while the set Q(y,s) := {A G R" x ™ | As — y} is non-compact 
(closed but unbounded), for a fixed B, the function / : Q(y, s) — > K, f(A) = \\A— B\\p 
is coercive and therefore a minimizer A* G Q(y, s) is attained. This demonstrates 
existence. The minimizer is also unique since Q(y, s) is convex while / is strictly 
convex. We claim that A* — B: Observe that Bs = y and so B G Q(y,s); for any 
A G Q{y, s), 



\B-B\\ F = 



ys 



Bt 



< WA-BW 



= \\A-B\ 



Lemma 6.2. Let y G R n , O^seR". Then the set Q(y, s) = {A G R nXn | As = 
y} contains a symmetric positive definite matrix iff y = Lv and v = L T s for some 
O^el" and L G GL(n). 

Proof. If such v and L exist, then y = Lv = LL T s and so LL T is a symmetric 
positive definite matrix in Q(y 1 s). On the other hand, if A G Q(y,s) is symmetric 
positive definite, its Cholesky factorization A = LL T yields an L G GL(n). If we let 
v = L T s, then Lv = As — y, as required. □ 

Theorem 6.3. Let y G R n , O^sel". Let L G GL(n) and /f = Li 1 ". T/iere is 
a symmetric positive definite matrix H + G Q(y,s) iff ' y 1 s > 0. in t/iis case, i/ie BFGS 
update H + = L + L^_ is one where 



(y- aHs)(L T s) T , / y T s 

L+=L+— ^ — with a = ±\/^— . (6.10) 



as T Hs 



s T Hs ' 



Proof. In order for the update (6.101 to exist it is necessary that there exists 



0^t)£ R™ and L + G GL(n) such that y = L + v and u = L\s. Hence 

< v T v = (Lls^iL^y) = s T y 

as required. 

If v is known, then the nearest matrix to L that takes v to y would be the update 
given in Theorem |6.1[ i.e. 



L + = L 



(y — Lv)v T 
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Hence we need to find the vector v. By Lemma |6. 2 



and so 



r T t ~~v y T s — v T L T s , . 

v = L\s = L T s+- f v (6.11) 



v = aL T s (6.12) 



for some ael. Now it remains to find the scalar a. Plugging (6.121 into ( 6.11[ ) and 
using H = LL T , we get 



y T s — as 1 Hs 9 y T s 

a =>■ a — 



a 2 s T Hs s T Hs 

If y 1 s > 0, this defines an update in Q(y, s) that is symmetric positive definite. It is 
straightforward to verify that H + = L + L\ yields the BFGS update 

Hss T H yy T 

n+ = H =— h 

s 1 Hs y' s 

□ 

Theorem 6.4. Let X £ Gi(n,r) and Xj_ be the orthogonal complement to 
X, i.e. [X X±] is orthogonal. Let A G Tx and X/±(t) be a geodesic with Tx,A(t) 



the corresponding transport matrix, defined according to equations (5.1) and (5.2). 
Identify Tx with K(" _r ) r and consider a linear operator in local coordinates A : 
jj(n-r)r _^ jj(«-f)r Consider the corresponding linear operator in global coordinates 
A : M. nr — > M. nr , in which tangents in Tx °- r £ embedded. The relation between the two 
operators is given by 

A=(I®X±_)A(I®Xl), (6.13) 
A=(I®Xl)A(I®X±_). (6.14) 

Furthermore, the parallel transported operator A has the same representation for all 
t along the geodesic X(t), i.e. A{t) = A. 

Proof. Let d\ £ IR(™ _r ) r be a tangent vector with corresponding global coordinate 
matrix representation Ai = X±Di £ Tx- Obviously d\ — vec(Di). We may write 
vec(Ai) = (7(8) X±)di. Set d 2 = Ad\ and it follows that vec(A 2 ) = (I® X±)d 2 . The 
corresponding operation in global coordinates are 

vec(A 2 ) = Avec(Ai) & (I <8 X±)d 2 = A(I <8 X±)di 
& d 2 = (I (8) Xl)A(I (8) X±_)di 



and it follows that A = (I (8 Xj_)A(I 8) X±), which proves ( |6.14 ). 

For any tangent A* £ Tx it holds that A* = 11^ A* — X±Xj_A*, where Hx is 
a projection onto Tx, and consequently vec(A*) = (/ (8> X±Xj_) vec(A*). Thus the 
operations in global coordinates also satisfy 

vec(A 2 ) = [I (8 X±Xl)A(I 8> X±Xl) vec(Ai) 

= (1(8 8) Xl) vcc(Ai) 

= Avec(Ai). 
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This proves equation (6.131. 



For the third part we have A 



T x and A(t) 



-x(t) 



-x(t) 



with 



A(0) = A. We want to prove that A{t) = A for all t. The operator A(t) is defined in 
the following sense: a tangent Ai(i) e T^x(t) is parallel transported with T X (t),-A(t){t) 
to X(Q) along X(t), the operator transformations is performed in Tx, thus A 2 = 
A(Ai(0)) € T x and the result is forwarded to T x(t) , i.e. A 2 (t) = T XA {t)A 2 . The 
parallel transported operator in global coordinates takes the form 

A{t) = (1(8) T XA (t))A(I (8) r x(t)j _ A(t) (t)). (6.15) 

Then, in the basis -Xj_(i), the local coordinate representation of the operator is 

A{t) = (I® Xl(t))A(t)(I (g) X ± {t)). (6.16) 

Substituting ( 6.15| ) into ( |6.16| ), we obtain 

A{t) = (I ® Xl(t)T x>A (t)X ± )A(I ® Xlf x(t)t _ A(t) (t)X ± (t)). 

Recall that T XA (t)X ± = X ± (t) and thus Xl{t)T XA (t)X ± = I. Similarly one can 
show that X]T X (t)_ A ( t )(t)Xx(t) = I and we get A{t) = A for all t. □ 

A different proof of essentially the same statement may be found in [52] . 

Lemma 6.5. Let e Gr(n i ,r i ), i = l,...,k, with corresponding tangent 
spaces T Xi ■ Let Yi be given such that [Xi Yi\ is orthogonal. On each Grassmannian 
Gr(rii,ri), let Xi(t) be a geodesic and Yi(t) be its orthogonal complement correspond- 
ing to the tangent Aj € T Xi . Then a local coordinate representation of the linear 
operator 

A : T Xl x • • • x T Xk -> T Xl x • • • x T Xk 

is independent of t when parallel transported along the geodesies Xi{t) and in the 
tangent basis Yi (t) . 

Proof. First we observe that the operator A must necessarily have the structure 



.4 



A u 



A k i 



A 



A' A.'. 



where each Ay, 1 < i,j < k is such that Ay : Tj. — > T^,. Now, applying a 



similar procedure as in Theorem 6.4 on each block Ay proves that local coordinate 



representation of A^ and thus A is independent of t along the geodesies Xi(t) in the 
tangent space basis Yi(t). □ 

Now we will give an explicit expression for the general BFGS update in tensor form 
and in local coordinates. We omit the hat and iteration index below for clarity. For a 
function defined on a product of k Grassmannians / : Gr(n l5 r x ) x • • • xGr(nt, r^) — ► K, 
we write f(X u . . . , X k ) and S = {St, . . . , S k ), Y = (Yi,..., Y k ) where X, G Gr(n i; n) 
and Si, Yi G T Xi for i — 1, . . . , k. The Hessian or its approximation has the symbolic 
form 



H = 



Hit • • • 'Hik 

Jikl ' ' ' T~Lkk 
22 



where each block is a 4-tensor. The BFGS update takes the form, 

v _ (H,s) l!2 ®(n,s) l!2 y®y 

n +- n+ ((n,s) h2 ,s) + <w (6 - 17) 



where the (W,S)i,2 is given by a formula similar to (6.8) with the result being a k- 
tuple, the tensor product between fc-tuples of tangents is an obvious generalization of 
(6791), and of course (S,Y) = £? =1 <Si,ii}. 

Finally, we have all the ingredients required to prove the optimality of the BFGS 
update on a product of Grassmannians. 

Theorem 6.6 (Optimality of BFGS update on product of Grassmannians). Con- 
sider a function f(X\, . . . , X k ) in the variables Xi € Gr(nj, Ti), i = 1, . . . , fc, that we 
want to minimize. Let Xi(t) be geodesic defined by Aj € Tx s with the corresponding 
tangent space basis matrices Yi(t). In these basis for the tangent spaces, the BFGS 
updates in (6.17) on the product Grassmannians have the same optimality properties 
as a function with variables in a Euclidean space, i.e. it is the least change update of 
the current Hessian approximation that satisfies the secant equations. 

Proof. First we observe that the Grassmann Hessian of f{X\,...,Xk) (or its 
approximation) is a linear operator 

H f : T Xl x ■ • • x T Xk -> T Xl x ■ ■ • x T Xk 



and according to Lemma 6.5 its local coordinate representation is constant along the 
geodesies Xi(t). Given this, the BFGS optimality result on product Grassmannians is 
a consequence from Theorem |6.3| — the optimality of BFGS in Euclidean space. □ 

Remark. An important difference on (product) Grassmannians is that we need 
to keep track of the basis for the tangent spaces — Yi(t) from equation (5.3). Only 
then will the local coordinate representation of an operator be independent of t when 
transported along geodesies. 

Note that Theorem |6.6| is a coordinate dependent result. If we regard Hessians 
as abstract operators, there will no longer be any difference between the global and 
the local scenario. But the corresponding optimality as the least amount of change 
in successive Hessians cannot be obtained in global coordinate representation and is 
thus not true if the Hessians are regarded as abstract operators. 

6.6. Other alternatives. Movement along geodesies and parallel transport of 
tangents are the most straightforward and natural generalizations to the key opera- 
tions from Euclidean spaces to manifolds. There are also methods for dealing with 
the manifold structure in optimization algorithms based on different principles. For 
example, instead of moving along geodesies from one point to another on the manifold 
one could use the notion of retractions, which is a smooth mapping from the tangent 
bundle of the manifold onto the manifold. Another example is the notion of vector 
transport that generalizes the parallel translation/transport of tangents used in this 
paper. All these notions are defined and described in [2]. It is not clear how the use 
of the more general vector transport would effect the convergence properties of the 
resulting BFGS methods. 

7. Limited memory BFGS. We give a brief summary of the limited memory 
quasi-Newton method with L-BFGS updates on Euclidean spaces [5] that we need later 
for our Grassmann variant. See also the discussion in [52l Chapter 7]. In Euclidean 
space the BFGS update can be represented in the following compact form 



H k = H +[S k H Y k ] 



Rl\D k + Y^H^R- 1 -R- 

-Rk 1 o 
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si 

Y k T H 



(7.1) 



where S k = [s , ■ ■ ■ ,s k -i], Y k = [y , . . .,y k -i], D k = diag [sjj/o, ■ ■ -^I-iJ/fc-i] and 



R k . — 



slvi 

s[yi 



slyk-i 
sly k -i 



Vk-i 



are obtained using equations (4.3) and (4.4 1. Observe that in this section S k and 



are not the same as in (6.2 1 and (6.3 1 respectively. The limited memory version of 
the algorithm is obtained when replacing the initial Hessian Ho by a sparse matrix, 
usually this is a suitably scaled identity matrix -f k I, and only keep the m most resent 



Sj and yj in the update (7.1 ). Since m <n the amount of storage and computations 



in each iteration is only a small fraction compared to the regular BFGS. According to 
[41)] satisfactory results are often achieved with 5 < m < 20, even for large problems. 
Our experiments confirm this heuristic. Thus for the limited memory BFGS we have 



H k = lk I+[S k lk Y k ] 



R k T (D k 



-l 
fe 



-Rl 





' s T k " 




> Y k. 



where now 



Sfe- 



•mj • • ■ j 



Sfc_i] , Yfc = [yk- 



,2/fe-i 



] , D k = diag [sl_ 



iVk- 



and 



(7.2) 



Rk — 







T.Vk-1 



'fe-m+l 



Uk-m+l 



Vk-l 







'fe-m+l 



7.1. Limited memory BFGS on Grassmannians. Analyzing the l-bfgs 
update above with the intent of modifying it to be applicable on Grassmannians, we 
observe the following: 



1. 



The columns in the matrices Sk and Y k represent tangents, and as such, they 
arc defined on a specific point of the manifold. In each iteration we need to 
parallel transport these vectors to the next tangent space. Assuming s k and 



y k are vectorized forms of (6.2) and (6.3) the transport amounts to computing 
S k = (I r ® T(t k ))S k and Y k = (I r <g) T(t k ))Y k where T(t k ) is the Grassmann 
transport matrix. 

The matrices R k and D k contain inner products between tangents. Fortu- 
nately, the inner products are invariant with respect to parallel transporting. 
Given vectors u,v € Tx k and a transport matrix T from Tx k to Tx fc+1 , 
i.e. Tu,Tv £ Tx k+1 , we have that (Tu,Tv) = (u,v). This is a direct result 
from Theorem |5.1[ showing that there is no need for modifying R k or D k . 
Because of this property one may wonder whether the transport matrix T is 
orthogonal, but this is not the case, T T T ^ I. 



3. Recalling the relation from equation (6.13) between local and global coordi 



nate representation of an operator, we conclude that the global representation 
is necessarily a singular matrix, simply because the local coordinate represen- 
tation of the operator is a smaller matrix. The same is true for the Hessian 
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using global coordinates. But by construction, the l-bfgs update H k in 



(7.2 1 is positive definite and thus nonsingular. This causes no problem since 



Tx k is an invariant subspace of H k , i.e. if v G Tj t 



Lemma 7.1 Similarly for the solution of the (quasi-)Newton equations (4.1 ) 



since y k G T Xk and H k : T Xk -> T Xk , then obviously p k e . This is 



valid for H k from both (7.1 1 and (7.2 1 



Lemma 7.1. The tangent space Tx k is an invariant subspace of the operator 
obtained by the L-BFGS update. 

Proof. This is straightforward. Simply observe that for a vector v k G Tx k we 
have that H k v k is a linear combination of vectors, and all of them belong to Tx k - D 

L-BFGS algorithms are intended for large scale problems where the storage of the 
full Hessian may not be possible. With this in mind we realize that the computation 
and storage of the orthogonal complement X±, which is used in local coordinate 
implementations, may not be practical. For large and sparse problems it is more 
economical to do the parallel transports explicitly than to update a basis for the 
tangent space. The computational time is reasonable since only 2(m — 1) vectors are 
parallel transported each step and m is usually very small compared to the dimensions 
of the Hessian. 

8. Quasi-Newton methods for the best multilinear rank approximation 
of a tensor. In this section we apply the algorithms developed in the last three sec- 
tions to the tensor approximation problem described earlier. Recall from Section |3.1| 
that the best multilinear rank-(p, q, r) approximation of a general tensor is equivalent 
to the maximization of 

$(X,Y,Z) = -\\A-(X,Y,Z)\\% s.t. X T X = I, Y T Y = I, Z T Z = L 

where A G R** m * n and X G M' xp , Y G R mx i, Z G R nxr . Recall also that X,Y,Z 
may be regarded as elements of Gv(l,p), Gr(m, q), and Gr(n, r) respectively and $ may 
be regarded as a function defined on a product of the three Grassmannians. The Grass- 
mann gradient of $ will consist of three parts. Setting T = A-(X, Y, Z), one can show 
that in global coordinates the gradient is the triplet V$ = (Hx& x ,HY®y,Hz&z), 
where 





= {A 




g m. lxp , 




= i- 


xx T , 


(8.1) 




= {A 


(X,I1 Y ,Z),F)_ 2 


e R mxq , 


n y 


= i - 


YY T , 


(8.2) 


n z $, 


= (A 


(x,y,n z ),jr)_ 3 


G M" xr , 


riz 


= i - 


ZZ T , 


(8.3) 



and $2 = d^/dX, $ y = d^/dY and $ 2 = d<&/dZ, see equation (6.1 1. For derivation 
of these formulas 3 see [5S] . 

To obtain the corresponding expressions in local coordinates we observe that a 
projection matrix can also be written as Tlx = X±Xj_. Then for tangent vectors 

G Tx, we have 

A, = fix A, - X ± XlA x = X ± D X , 

which gives the local coordinates of A x as X]_A X = D. x . The practical implication 
of these manipulations is that in local coordinates we simply replace the projection 
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matrices n x ,n y ,n z with X\,Yl,Z\. We get V$ = (X~[$ x , Yl<f> y , Z~]_<5> z ), where 

Xl$ x = {A-(X±,Y t Z),?)- 1 eR"-"^, (8.4) 
Yl^> y = (A- (X,Y ± ,Z),F>- 2 eR (m -« )x «, (8.5) 
Zl$ m = {A-{X,Y,Z ± ),J : )- a eR {n ~ r)xr . (8.6) 

Note that the expressions of the gradient in global and local coordinates are different. 
In order to distinguish between them we put a hat on the gradient, i.e. V<&, when it 
is expressed in local coordinates. 

8.1. General expression for Grassmann gradients and Hessians. In the 

general case we will have an order-/c tensor A £ R™ 1 *'" Xnk and the objective function 
takes the form 

$(X u ...,X k ) = ^\\A-(X 1 ,..,,X k )\\%. 

The low rank approximation problem becomes 

rnax$(X u ...,X k ) s.t. XjX l = I, i = l,...,k. 

The same procedure used to derive the gradients for the order-3 case can be used 
for the general case. The results are obvious modifications of what we have for 3- 
tensors. First we introduce matrices Xix, i = 1, . ■ ■ ,k, such that each [Xi forms 
an orthogonal matrix and we define the tensors 

T = A ■ (Xi, X 2: X- 3 , . . . , X k ), 
Bi = A ■ (Xi±, X 2l X 3 , . . . , X k ), 
B2 = A - (Xi,X 2 ±,X a , . . . , Xk), 

(8-7) 

Bk-l — A ■ (Xi, . . . , X k _2: -^fc-l.Xl -^fc)i 

B k = A ■ {X\, . . . , X k -2,X k -i,X k ±). 

The Grassmann gradient of the objective function in local coordinates is given by the 
fc-tuple 

V$ = ($i,$ 2 ,...,$ fc ), $i = (Bi, F)_i, i=l,2,...,fc. 

Each $i is an (rii — r{) x matrix representing a tangent in Txi- To obtain the 
corresponding global coordinate representation, simply replace each Xi± with the 
projection H Xi = I — X { Xj . 

We will also give the expression of the Hessian since we may wish to initialize 
our approximate Hessian with the exact Hessian. Furthermore, in our numerical 



experiments in Section 11 the expression for the Hessian will be useful for checking 
whether our algorithms have indeed arrived at a local maximum. In order to express 
the Hessian, we will need to introduce the additional variables 

Cl2 

Cl3 C23 

C2,fe • • • C k -i,k 

26 



where each term is a multilinear tensor-matrix product involving the tensor A and a 
subset of the matrices in {A"i, . . . , X^, In, . . . , X k ±}. The subscripts i and j in C^- 
indicate that Xi± and Xj± are multiplied in the ith and jth mode of A, respectively. 
All other modes are multiplied with the corresponding X c i, d ^ i and d 7^ j. For 
example we have 

C12 = A ■ (Xi±. X 2 ±, X 3 , . . . , Xk), 
C24 = A ■ (Xi , X 2 ± , X3 , X4± , X5 , . . . , Xk ) , 
Cfc-i,fc = A ■ (Xi, .... A fe _ 2 , A fe _ lj j_, X k j_). 

Together with B\, . . . , introduced earlier, one can express the complete Grassmann 
Hessian of the objective function $>(Xi, . . . ,Xk). The derivation of the Hessian is 
somewhat tricky. The interested reader should refer to [35] for details. In this paper 
we only state the final result in a form that can be directly implemented in a solver. 
The diagonal blocks of the Hessian are Sylvester operators and have the form 

UaiPi) = (Bi,Bi)-iDi- Di{T,r)-i, i = l,2,...,k. 

The off-diagonal block operators are 

^12(^2) = ((Cl2,-7 r )-(l,2),-C ) 2)2,4;l,2 + ( (#1 , B 2 ) - (1,2) > D 2 ) 4,2;1,2 7 
Hij(Dj) = ((Cij,J 7 )-(i 1 j),Dj)2,4;l,2 + ((Bi, Bj) -(t,j), 4,2;1,2, 

where i ^ j , i < j and j, j = 1, 2, . . . , k. See Appendix [A] for definition of the con- 
tracted products ( •, .j). 

9. Best multilinear rank approximation of a symmetric tensor. Recall 
from Section [2] that an order-fc tensor S G l nX '" x '' is called symmetric if 

s MiyMfc) = s i 1 ...i k , ii, . . . , ik S {1, • • • , n}, 

where a € the set of all permutations with fc integers. For example, a third order 
cubical tensor S £ f' iX ' lxn is symmetric iff 

Sijk &ikj Sjik &jki S k ij S k ji 

for all k € {1, ...,n}. The definition given above is equivalent to the usual 
definition given in, say [30]; see [5] for a proof of this simple equivalence. Recall 
also that the set of all order-fc dimension-n symmetric tensors is denoted S fe (IR™). 
This is a subspace of M nx '" xn and 

dimS fe (R«) = ( n + ^ 1 ). 

Lemma 9.1. IfSe S fe (C") and rank(S) = (n, . . . , r k ), then 

n = ■■■ = Th- 
in other words, the multilinear rank of a symmetric tensor is always of the form 
(r, . . . ,r) for some r. We will write rs(S) for this common value. Furthermore, we 
have a multilinear decomposition of the following form 

S = (X,X,...,X)-C, (9.1) 
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where C € S k (W) and X e 0(n,r). 

Proof. The ranks r t being equal follows from observing that the matricizations 
SW,...,SW of S are, due to symmetry, all equal. The factorization (|9.1|) is a conse- 



quence of the higher order singular value decomposition (hosvd) [T5]. □ 

In application where noise is an inevitable factor, we would like to study instead 
the approximation problem 

S^(X,X,...,X)-C, 
instead of the exact decomposition in ( |9.1[ ). More precisely, we want to solve 

min{||5 — T\\f I T eS k (R n ), r s (T) < r}. (9.2) 
Similar analysis as in the general case shows that the minimization problem 



(9.2) can be reformulated as a maximization of ||<S ■ (X, . . . , X)\\p, with the con- 
straint X T X = I. The objective function becomes $(A) = | (J 7 , J 7 ) where now 
J- = S ■ (X, . . . , X). Observe that the symmetric tensor approximation problem is 
defined on one Grassmannian only, regardless of the order of the tensor. These prob- 
lems require much less storage and computations compared to a general problem of 
the same dimensions. Applications involving symmetric tensors are found in signal 
processing, independent component analysis, and the analysis of multivariate cumu- 
lants in statistics [IHl HI E3 HH HH HH El Hll [H]. We refer interested readers to [3] 
for discussion of a different notion of rank for symmetric tensors. 

9.1. The symmetric Grassmann gradient. The same procedure for deriving 
the gradient for the general case can be used to obtain the gradient for the symmetric 
case. In particular it involves the very same terms as the nonsymmetric gradient with 
obvious modifications. It is straightforward to show that, due to symmetry of S ', 

(S- (IL X ,X,X), J-)_ a = (S-(X,I1 X ,X),F}_ 2 = (S ■ (X,X,Il x ),F)_ 3 . 

We will use the first expression without loss of generality. In which case, the Grass- 
mann gradient in global coordinates becomes 

V$ - Il x <Z> x = 3(5 • {U X ,X, X), .F)_ X) (9.3) 

where H X = I — XX T ; and in local coordinate it is 

V$ = X X $ X =3(S-(X X ,X,X),F)- 1 , (9.4) 



where Xj_ is the orthogonal complement of X . Compare these with equations (8.1 1 



(8.3) for the general case. 



9.2. The symmetric Grassmann Hessian. As for the general case discussed 
in [25j . we may identify the second order terms in the Taylor expansion of $(Aa(£)). 
There are 15 second order terms and all have the form 

(A,H,{A)), A € Tx and X e Gr(n,r), 

for some linear operator r H. lf . Two specific examples are 

(AHn(A)) = (A, (B 1 ,B 1 )_ 1 A-A(.F J .F)-i>, 

(A,Wi 2 (A)> = (A,«C 12 ,^>_ (li2) ,A) 2 ,4 ; i,2 + «Bi,B 2 }-(i,2),A)4, 2; i, 2 >, 
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where B x = S ■ (U X ,X,X), B 2 = S ■ (X,Jl x ,X) and C 12 = S ■ (U x ,Il x> X). The 
subscripts 1 and 2 indicate that the projection matrix IIx is multiplied with S in the 
first and second mode respectively. Not surprisingly, analysis of these terms reveals 
equality among the second order terms due to the symmetry of S. Gathering like 
terms and summing up the expressions, we see that the Hessian is a sum of three 
different terms, 

(A,Hi(A)) =(A,3(B 1 ,B 1 )_ 1 A-3n x A(7-,7-)_ 1 ) ) (9.5) 
(A,H 2 (A)) =(A,6((C 12 ,J-)_ (1)2) ,A) 2 , 4; i,2), (9.6) 
(A,H 3 (A)) =(A,6((B 1 ,B 2 )_ (1>2) ,A) 4 , 2;1 , 2 ). (9.7) 

So the action of the Hessian on a tangent is simply 

H(A) = Hi (A) + %(A) + H 3 (A). 



Observe that the second term in (9.5 1 arises from the fact that the objective function 
is defined on a Grassmannian, see [21] for details. 

9.3. General expression for Grassmann gradients and Hessians for a 
symmetric tensor. With the analysis and expressions for symmetric 3-tensors at 
hand, generalization to symmetric fc-tensors is straightforward. We will only state the 
final results and in local coordinates. Assume we have an order-fc symmetric tensor 
S G S fe (R"). The corresponding symmetric low rank tensor approximation problem 
is written as 

max$(X) =maxi||«S- (X,...,X)\\% s.t. IeGr(n,r). 

Using the tensor products 

J- = S ■ (X, X, . . . , X) X appears k times, 
B\ — S ■ (X±,X, . . . , X) X appears k — 1 times, 

where X± is such that [X X±_] forms an orthogonal matrix, the Grassmann gradient 
becomes 

Observe that the symmetric case involves the very same tensor products Bi as in the 
general case (given in Section 8.1 ) but due to the symmetry of the problem all terms 
are equal. 

We also introduce tensor-matrix multilinear products Cij similar to those in equa- 



tion 



.8). Two specific examples are 



C12 — s ■ (X±,X±,X, . . . , X), 

C 2 4 — s ■ (X, X± , X, X± , X, . . . , X) . 

In general Cy, where i ^ j, i < j and i,j = 1, . . . , k, is a multilinear product of two 
Xj_'s that are multiplied in the ith and jth mode of S. All other modes are multiplied 
with X. 

The second order terms of the Taylor expansion of &(X) contain the following 
diagonal block operators 



H U {D) = (Bi,Bi)-iD- A{F,F)_i, i = l,2,...,k. 
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Again, due to symmetry all these are identical and summing them up we get 

H dia , s {D) = k{{B 1 ,B l )- 1 D-D{F,P)- 1 ). 
The off-diagonal block operators have the form 

H X2 {D) = «Ci2, J0- ( i,2), D) 2A . lr2 + «Bi,B 2 )_ (li2) , £>) 4l 2;i,2, 

Uij(D) = (\Cij, J 7 )-(iJ),-D)2,4;l,2 + ((Bi,Bj}-{i,j),D)^ 2; l,2, 

where i ^ j , i < j and i, j = l,...,k. Similarly, due to symmetry all of them are 
identical. We have 

ffoff-diag(^) = fc(fe-l)(((C 12) 7-)_ (1>2)l i?) 2 ,4 ; l ) 2 + <<S 1 ,B 2 )_ (1)2) , J D)4,2 ; l > 2)- 

The complete Grassmann Hessian operator is simply 

H = i?diag + -ffoff-diag- 

9.4. Matricizing the Hessian operator. The second order terms are de- 
scribed using the canonical inner product on Grassmannians and contracted tensor 
products. Next we will derive the expression of the Hessian as a matrix acting on the 
vector d = vec(A). 



The terms in (9.5) involve only matrix operations and vectorizing the second 



argument in the inner product yields 

vec ((Wi(A)) = vec(3<Bi,Bi)_iA - 3U X A(T,T)^) 

= 3 (7 (g) {Bi,Bi)-i - {J 7 , O Tlx) vec(A) 
= Hid. 



nnxnxrxr 



The vectorization of the terms from (9.6) and (9.7|) involve the 4-tensors 

Hi = (Cl2; J")_(l,2) € 

"(1,2) 

and is done using the tensor matricization described in [25] . We get 



H 3 = (B u B 2 )_ ih2) eR nxrxrxn , 



vec((H 2 , A) 2 ,4;i,2) = -fff ,1;4 ' 2) vec(A) = H 2 d, (9.8) 
vec((H 3 , A) 4)2;1 ,2) = i/f 1;2 ' 4) vec(A) = H 3 d. (9.9) 

In H 2 we map indices of the first and third mode to row-indices and indices of the 
second and fourth mode to column-indices obtaining the matrix H^' 1 ' 4 ' 2 ^. In this 
way the contractions in the matrix-vector product coincide with the tensor-matrix 
contractions. Similarly for H 3 . The matrix form of the Hessian becomes 

H = Hi + H 2 + 7?3 . 

To obtain the Hessian in local coordinates we replace IIjc with X± in the computations 
of the factors involved and thereafter perform the same matricization procedure. 

30 



10. Examples. We will now give two small explicit examples to illustrate the 
computations involved in the algorithms for tensor approximation described before. 

Example 10.1. In this example we will compute the gradient of the objective 
function, both in global and in local coordinates. Let the 3x3x3 tensor A be given 
by 



A(:,:,l) 



9 -3 
2 7 
7 



A(:,:,2) 



2 7 " 




"3 -2" 


-7 5 -3 


,A(:,:,3) = 


4-1 


-3 1 




0-2 1 



Let the current point of the product manifold be given by (X, Y, Z) where 



X =Y = Z = 



II; 



X = Uy = ll Z = 



ooo 
o 1 o 
o o 1 



are the corresponding projection matrices onto the three tangent spaces. The expres- 
sion for the Grassmann gradient at the current iterate is given by (8.1 1 — ( 8.3 ) . The 



intermediate quantities, cf. equation (8.7), needed in the calculations of the Grass- 
mann gradient are 



T = A - (X, Y, Z) = 
B x = A-(n x ,Y,Z). 
B y =A-(X,U Y ,Z) 



9, 



(0 2 7) T , 
= (0-3 8) T , 
B z =A-(X,Y,U z ) = (0 2 3) T , 

and the Grassmann gradient in global coordinates is given by 



(10.1) 
















18 




-27 




18 


63 




72 
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To compute the Grassmann gradient in local coordinates we need a basis for the tangent 
spaces. For the current iterate we choose 



X\ = Y\=Z\ = 





1 
1 



as the corresponding basis matrices for the tangent spaces at X , Y and Z . Obviously 
[X X x ]> [Y Y x ] and [Z Z x ] are orthogonal and X T X ± = Y T Y ± = Z T Z ± = 0. Re- 
placing the projection matrices Tlx, Ily and Hz by the orthogonal complements X±, 
Y± and Zj_ in (10.1), we obtain B x ,B y ,B z , and thus the local coordinate representa- 



tion of the Grassmann gradient is given by 

V% = {(B xl F)_ ll (B v ,F)_ 2l (B z ,F\ 



18 




-27 




18 


63 




72 
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Recall that we use a hat to distinguish local coordinate representation from global 
coordinate representation. The local coordinate representation is depending on the 
choice of basis matrices for the tangent spaces. A different choice of X± , Yj_ and Z± 
would yield a different representation o/V$. 

Example 10.2. Next we will illustrate the parallel transport of tangent vectors 
along geodesies on a product of Grassmannians. Let the tensor A, the current iterate, 



31 



and the corresponding gradient be the same as in the previous example. Introduce 
tangent vectors 



(A x , A v , A z ) = 
















-1 







7 


1 







1 








Clearly we have X T A x = Y T A y = Z T A z = 0. The tangent A will determine the 
geodesic path from the current point and in turn the transport of the Grassmann 
gradient (see Figure 10.1). We may also verify that V$ is indeed a tangent of the 
product Grassmannian at the current iterate. 

The thin or compact SVDs, written A* = [/* • S* ■ Vj, of the tangents are 



A x = 





-1 




1-1, 



Ay = 



1-1, A z = 



1 • 1. 



The transport matrix, cf. equation (5.2), in the direction A x at X with a step size 
t = 7r/4 is given by 



-sin£ x (7r/4) 
cos S x (tt/4) 



Uj + (I-U x Uj) 



h^/a-[xv x u x ] 

Similarly, it is straightforward to calculate 

and T ZjAz (t)\ t=n/4 



Tr,A v («)| t=7r/4 



1 -1/V2 
1 
1/V2 



1 1/V2 0' 
1/V2 
1 



1 -1/V2 
1/V2 

1 



Parallel transporting one tangent we getT x ,A x {n/4:)A x = (-1/V2, -1/V2, 0) T . For 
all tangents in A and V$ we get 




-1/V2 




'-1/V2 




"-1/^" 


\ 


-1/V2 









1/V2 









. 1/V2. 









18/^2" 




'-72/V2 




"-18/-/ 


2" 


18/V2 




-27 




18/V2 


63 




72/V2 
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T/ie above are calculations in global coordinates. In local coordinates we parallel 
transport the basis matrices Xj_, Yj_ and Zj_ so that the local coordinate representation 
of a tangent is the same as in the previous point. The computations are given by 
equation (5.3) and in this example we get 

Y ± (n/A) = 



*x(tt/4) 



1/V2 

1/-/2 
1 



-1/V2 


1/V2 



-1/-/2 0' 

1/V2 

1 



i.e. the second and third columns of each transport matrix due to the specific choice 
of X±, Y± and Z±. 
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Fig. 10.1. Pictorial depiction of the main algorithmic procedure for the Grassmannian Gr(3, 1), 
which is simply the sphere S 2 . 



Taking a step of size t = 7r/4 from X, Y and Z along the specified geodesic we 
arrive at 



X(tt/4) = 



1/V2" 

-1/V2 




F(7r/4) = 



1/V2 


1/V2 



Z(n/4) = 



1/V2] 
1/V2 




The value of the objective function at the starting point is <&(X, Y, Z) = 40.5 and at 
the new point is $ (A(-7r/4), Y(tt/4), X(tt/4)) « 45.5625, an increment as expected. 

Figure |10.1| illustrates the procedures involved in the algorithms on the Grass- 
mannian Gr(3, 1), which we may regard as the 2-sphere S 2 (unit sphere in M 3 ). For 
the best rank-1 tensor approximation of a 3 x 3 x 3 tensor, the optimization takes 
place on a product of three spheres S 2 x S 2 x S 2 , one for each vector that needs to 
be determined. The procedure starts at a point X\ and a direction of ascent]^] the 
tangent Ai G Tjj , is obtained through some method. Next we perform a movement 
of the point X\ along the geodesic defined by Ai. Geodesies on spheres are just great 
circles. At the new point Xi G Gr(3, 1) we repeat the procedure, i.e. determine a new 
direction of ascent A 2 and take a geodesic step in this direction. 

11. Numerical experiments and computational complexity. All algo- 
rithms described here and the object oriented Grassmann classes required for them 
are available for download as two matlab packages [SO] and [ST]. We encourage our 
readers to try them out. 

11.1. Initialization and stopping condition. We will now test the actual 
performance of our algorithms with a few large numerical examples. All algorithms 
in a given test are started with the same initial points on a Grassmannian, repre- 
sented as truncated singular matrices from the hosvd and a number of additional 
higher order orthogonal iterations — HOOl iterations |15[ 1 1 6] , which are introduced to 
make the initial Hessian of $ negative definite. The number of initial HOOl iterations 
ranges between 5 and 50 depending on the size of the problem. The BFGS algorithm is 
either started with (possibly a modification of) the exact Hessian or a scaled identity 
matrix according to [Ml pp. 143]. The L-BFGS algorithm is always started with a 
scaled identity matrix but one can modify the number of columns m in the matrices 



representing the Hessian approximation, see equation (7.2). This number is between 



6 Recall that we are maximizing <E>, therefore 'ascent' as opposed to 'descent'. 
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Fig. 11.1. Left: A 20 X 20 X 20 tensor is approximated by a rank-(5, 5, 5) tensor. BFGS initiated 
with the exact Hessian; In L-BFGS m = 5. Right: A 100 X 100 X 100 tensor is approximated by a 
rank-(5, 10, 20) tensor. In this case the initial Hessian is a scaled identity and m = 10. 



5 and 30. Although we use the HOSVD to initialize our algorithms, any other reason- 
able initialization procedure would work as long as the initial Hessian approximate is 
negative definite. The quasi-Newton methods can be used as stand-alone algorithms 
for solving the tensor approximation problem as well as other problems defined on 
Grassmannians. 

In the following figures, the y-axis measures the norm of the relative gradient, 
i.e. ||V ( I ) (X)||/||$(X)||, and the x-axis shows iterations. This ratio is also used as 
our stopping condition, which typically requires that |j V$(A)||/||<I>(A)| ps 10~ 13 , the 
machine precision of our computer. At a true local maximizer the gradient of the 
objective function is zero and its Hessian is negative definite. In the various figures 
we present convergence results for four principally different algorithms. These are 
(1) quasi-Newton-Grassmann with BFGS, (2) quasi-Newton-Grassmann with l-bfgs, 
(3) Newton-Grassmann, denoted with NG and (4) HOOl which is an alternating least 
squares approach. In addition, the tags for BFGS methods may be accompanied by I 
or H indicating whether the initial Hessian was a scaled identity matrix or the exact 
Hessian, respectively. 

11.2. Experimental results. We run all our numerical experiments in MATLAB 
on a MacBook with a 2.4-GHz Intel Core 2 Duo processor and 4 GB of physical 
memory. 



Figure 11.1 shows convergence results for two tests with tensors generated with 
N(0, l)-distributed values. In the left plot a 20 x 20 x 20 tensor is approximated with 
a rank-(5, 5, 5) tensor. One can observe superlinear convergence in the BFGS method. 
The right plot shows convergence results of a 100 x 100 x 100 tensor approximated with 
a rank-(5, 10, 20) tensor. Both BFGS and l-bfgs methods exhibit rapid convergence 
in the vicinity of a stationary point. 



Figure 11.2 (left) shows convergence for an even larger 200 x 200 x 200 tensor 
approximated by a tensor of rank-(10, 10, 10) using l-bfgs with m = 20. In the right 
plot we approximate a 50 x 50 x 50 tensor by a rank-(20, 20, 20) tensor where we vary 
over a range of values of m in the L-BFGS algorithm, namely, m = 5, 10, 15, 20, 25, 30. 
m = 5 gives (in general) slightly poorer performance, otherwise the different runs 
cannot be distinguished. In other words, our Grassmann l-bfgs algorithm can in 
practice work as well as our Grassmann BFGS algorithm, just as one would expect 
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Fig. 11.2. Left: Convergence plots of a 200 X 200 X 200 tensor approximated by a rank-(10, 10, 10) 
tensor. Right: Effect of varying m in L-BFGS. A 50 X 50 X 50 tensor approximated by a rank- 
(20, 20, 20) tensor with m = 5, 10, 15, 20, 25, 30. 




50 100 150 200 100 200 300 400 500 

ITERATION # ITERATION # 

Fig. 11.3. Left: A 50 X 50 X 50 symmetric tensor is approximated by a rank-5 symmetric 
tensor; m = 10. Right: Here we have a 100 X 100 X 100 symmetric tensor approximated by a rank-20 
symmetric tensor; m = 10. 

(from the numerical experiments performed) in the Euclidean case. 

Figure |11.3| shows convergence plots for two symmetric tensor approximation 
problems. In the left plot we approximate a symmetric 50 x 50 x 50 tensor by a 
rank-5 symmetric tensor. We observe that BFGS initialized with the exact Hessian 
(bfgS:h tag) converges much more rapidly, almost as fast as the Newton-Grassmann 
method, than when initialized with a scaled identity matrix (BFGS:i tag). In the right 
plot we give convergence results for a 100 x 100 x 100 symmetric tensor approximated 
by a rank-20 symmetric tensor. In both cases m = 10. 

In Figure |11.4| we show the performance of a local coordinate implementation of 
the BFGS algorithm on problems with 4-tensors. The first plot shows convergence 
results for a 50 x 50 x 50 x 50 tensor approximated by a rank-(5, 5, 5, 5) tensor. The 
second convergence plot is for a symmetric 4-tensor with the same dimensions ap- 
proximated by a symmetric rank-5 tensor. Again the H and I tags indicate whether 
the exact Hessian or a scaled identity is used for initialization. 

We end this section with two unusual examples to illustrate the extent of our 
algorithms' applicability: a high order tensor and an objective function that includes 
tensors of different orders. The left plot in Figure |11.5| is a high-order example: it 
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Fig. 11.4. Left: A 50 X 50 X 50 X 50 tensor is approximated by a rank-(5, 5, 5, 5) tensor. Right: 
Here we have a 50 X 50 X 50 X 50 symmetric tensor approximated by a rank-5 symmetric tensor. 
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Fig. 11.5. Left: A 5 X 5 X • ■ ■ X 5 tensor of order-10 is approximated with a rank-(2, 2, ... ,2). 
Right: A 'simultaneous' rank-5 approximation of a weighted sum of tensors of orders 2, 3, and 4. 



shows the convergence of BFGS verses HOOl when approximating an order-10 tensor 
with dimensions 5 x 5 x • • ■ x 5 with a rank-(2, 2, ... ,2) tensor. The right plot in 
Figure |11.5| has an unusual objective function that involves an order-2, an order-3, 
and an order-4 tensor, 

*(X) = ^\\S2 ■ (X,X)\\% + i||5 3 • (X,X,X)\\% + i||5 4 • (X,X,X,X)\\ 2 F 

where S2 is a 30 x 30 symmetric matrix, £3 is a 30 x 30 x 30 symmetric 3-tensor, 
and ^4 is a 30 x 30 x 30 x 30 symmetric 4-tensor. Such objective functions have 
appeared in independent component analysis with soft whitening |18j and in principal 
cumulants components analysis |44[ 14 1] where S%, Ss,Si measure the multivariate 
variance, kurtosis, skewness respectively (cf. Example 2.2). In both examples we 
observe a fast rate of convergence at the vicinity of a local minimizer for the BFGS 
algorithm. 

It is evident from the convergence plots here that BFGS and L-BFGS have faster 
rate of convergence compared with HOOl. The Newton-Grassmann algorithm takes 
few iterations but is computationally more expensive, specifically for larger problems. 
Our implementation of the different algorithms in MATLAB give shortest runtime for 
the BFGS and L-BFGS methods. The time for one iteration of BFGS, l-bfgs and HOOl is 
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Algorithm 1 Algorithmic framework for BFGS and l-bfgs on Grassmannians. 



Given tensor A and starting points (Xq, Y ,Z ) £ Gr 3 and an initial Hessian H 
repeat 

1 Compute the Grassmann gradient. 

2 Parallel transport the Hessian approximation to the new point. 

3 Update the Hessian or its compact representation. 

4 Solve the quasi-Newton equations to obtain A = (A x , A y , A z ). 

5 Move the points (Xk, Yfe, Zk) along the geodesic curve given by A. 
until ||V$||/$ < TOL 



of the same magnitude for smaller problems. In larger problems, the l-bfgs performs 
much faster than all other methods. 

Our algorithms use the basic arithmetic and data types in the Tensor Toolbox [3] 
for convenience. We use our own object-oriented routines for operations on Grassman- 
nians and product of Grassmannians, e.g. geodesic movements and parallel transports 
[51] . We note that there are several different ways to implement BFGS updates [46] ; for 
simplicity reasons, we have chosen to update the inverse of the Hessian approximation. 
A possibly better alternative will be to update the Cholesky factors of the approxi- 
mate Hessians so that one may monitor the approximate Hessians for indefmiteness 
during the iterations [251 EM I?]. 

11.3. Computational complexity, curse of dimensionality, and conver- 
gence. The Grassmann quasi-Newton methods presented in this report all fit within 
the procedural framework given in Algorithm [l] 

General case. In analyzing computational complexity, we will assume for simplic- 
ity that A is a general n x n x n 3-tensor being approximated with a rank-(r, r, r) 
3-tensor. A problem of these dimensions will give rise to a 3nr x 3nr Hessian matrix 
in global coordinates and a 3(n — r)r x 3(n — r)r Hessian matrix in local coordinates. 
Table [TlTT] gives approximately the amount of computations required in each step of 
Algorithm [T] Recall that in L-BFGS to is a small number, see Section [7] We have 





BFGS-GC 


BFGS-LC 


L-BFGS 


1 


6n 3 r + 12n 2 r 2 


6n 3 r + 6n 2 r 2 + 6n(n — r)r 3 


6n 3 r + 12n 2 r 2 


2 


18n 3 r 2 




12n 2 rm 


3 


36n 2 r 2 


36(n -r) 2 r 2 




4 


18n 2 r 2 


18(n -r) 2 r 2 


24nrTO 



Table 11.1 



Computational complexity of the BFGS-GC (global coordinates), BFGS-LC (local coordinates) and 
L-BFGS algorithms. The numbers in the first column correspond to the steps in Algorithm\l\ 



omitted terms of lower asymptotic complexity as well as the cost of point 5 since 
that is negligible compared with the costs of points 1-4. For example, the geodesic 
movement of Xk requires the thin SVD U x T, x Vj — A K €E E" xr which takes 6nr 2 + 20r 3 
flops (floating point operations) ■ On the other hand, given the step length t and 
U, S, V in (5.1 1, the actual computation of X(t) amounts to only 4nr 2 flops. 

Symmetric case. The symmetric tensor approximation problem involves the de- 
termination of one n x r matrix, resulting in an nr x nr Hessian in global coordinates 
and an (n — r)r x (n — r)r Hessian in local coordinates. Therefore the complexity of 
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the symmetric problem differs only by a constant factor from that of the general case. 

Curse of dimensionality. The approximation problem will suffer from the curse 
of dimensionality when the order of a tensor increases. In general, an n x • • • x n 
order-fc tensor requires the storage of n k entries in memory. The additional mem- 
ory requirement, mainly for storing the Hessian, is of order 0(n 2 r 2 k 2 ) for the BFGS 
methods and 0(2nrkm) for the L-BFGS method, respectively. In the current approach 
we assume that the tensor is explicitly given. Our proposed algorithms are applica- 
ble as long as the given tensor fits in memory. There have been various proposals 
to deal with the curse of dimensionality using tensors {35j |36j [48] . For cases where 
the tensor is represented in compact or functional forms our methods can take direct 
advantage of these simply by computing the necessary gradients (and Hessians) using 
the specific representations. In fact this was considered in |43) for symmetric tensor 
approximations. 

Convergence. There is empirical evidence suggesting that ALS based algorithms 
have fast convergence rate for specific tensors. This was also pointed out in \V2\ . 
These are tensors that have inherently low multilinear rank and the approximating 
tensor has the correct low ranks, or tensors that have fast decay in its multilinear 
singular values |15j . or a substantial gap in the multilinear singular values at the site 
of truncation, e.g. the source tensor is given by a low rank tensor with noise added. On 
the other hand not all tensors have gaps or fast decaying multilinear singular vales. 
This is specifically true for sparse tensors. It is still desirable to obtain low rank 
approximations for these "more difficult" tensors. And on these tensors ALS performs 
very poorly, but methods using first and second order derivatives of the objective 
function, including the methods presented in this paper perform good. Among the 
methods that are currently available, quasi-Newton methods presented in this paper 
have the best computational efficiency. 

12. Related work. There are several different approaches to solve the tensor 
approximation problem. In this section we will briefly describe them and point out 
the main differences with our work. The algorithms most closely related to the quasi- 
Newton methods are given in [SS] [3H [33J . All three references address the best low 
rank tensor approximation based on the Grassmannian structure of the problem and 
use explicit computation of the Hessian. The obtained Newton equations are solved 
either fully [25l [34] or approximately [33] . In the latter case the authors used a 
truncated conjugate gradient approach to approximately solve the Newton equations. 
The iterates are updated using the more general notion of retractions instead of taking 
a step along the geodesic on the manifold. In addition a trust region scheme is 
incorporated making the procedure more stable with respect to occasional indefinite 
Hessians. The computation of the Hessian is a limiting factor in these algorithms. 
This is the case even when the Hessian is not formed explicitly but used implicitly 
via its action on a tangent. In our experiments, on moderate-sized problems, e.g. 3- 
tensors of dimensions around 20 x 20 x 20, the BFGS methods noticeably outperformed 
Hessian-based methods; and for dimensions around 100 x 100 x 100, we were unable 
to get any methods relying on Hessians to work despite our best efforts. 

There is a different line of algorithms for related tensor approximation problems 
based on ALS and mutltigrid accelerated ALS 35, 36 . In our experience, the conver- 
gence of ALS-typc methods depend on the decay of the multilinear singular values 
of the given tensor. The exact dependence is unclear but the relation seems to be 
that the faster the decay, the faster the convergence of ALS. In this regard the class 
of functions and operators considered in [35], 1361 appears to possess these favorable 
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properties. 

Yet a third approach to obtain low multilinear rank tensor approximations are 
the cross methods in jJSJ 2H1 US] • The novelty of such methods is that they discard 
some given information and retain only a fraction of the original tensor, and as such 
it is markedly different from our approach, which uses all given information to achieve 
maximal accuracy. In addition, there is an assumption on the tensor that there exist 
approximations within pre-specified bounds and of specific low ranks while we make 
no such assumptions. 

13. Conclusion. In this paper we studied quasi-Newton algorithms adapted to 
optimization problems on Riemannian manifolds. More specifically, we proposed algo- 
rithms with BFGS and L-BFGS updates on a product of Grassmannians that (1) respect 
the Riemannian metric structure and (2) require only standard matrix operations in 
their implementations. Two different algorithmic implementations are presented: one 
based on local/intrinsic coordinates while the other one uses global/embedded coordi- 
nates. In particular, our use of local coordinates is a novelty not previously explored in 
other manifold optimization 12 2lJ [57] . We proved the optimality of our Grassman- 
nian BFGS updates in local coordinates, showing that the well-known BFGS optimality 
[2"T1 |2"5] extends to Grassmannian and products of Grassmannians. 

We also applied these algorithms to the problem of determining a best multilinear 
rank approximation of a tensor and the analogous (but very different) problem for a 
symmetric tensor. While a Newton version of this was proposed in [25], here we make 
substantial improvements with respect to the Grassmann-Newton algorithm in terms 
of speed and robustness. Furthermore, we presented specialized algorithms that take 
into account the symmetry in the multilinear approximation of symmetric tensors and 
related problems. In addition to the numerical experiments in this paper, we have 
made our codes freely available for download (EH [51] so that the reader may verify 
the speed, accuracy, and robustness of these algorithms for himself. 

Appendix A. Notation for tensor contractions. 

In this section we define the contracted tensor product notation used throughout 
this paper. For given third order tensors A and B we define the following contracted 
products: 

C = (A,B}i, Cijki = },a\ijb\M. (A.l) 

A 

When contracting several indices, with the corresponding indices of the two arguments 
being the same, we write 

C = (A,B)t,2, Ci = y^axvibxyj. (A. 2) 

X,v 

The subscript T' in (A, B)\ and subscripts '1,2' in (A 1 B)\.2 indicate that the con- 
traction is over the first index and both the first and second indices respectively. If 
instead the contraction is to be performed on different indices, we write 

C = (A,B)i-2, c ijk i =*^2a\ijb k xi or C = (A, B)i t3 - 2 ,i, c t j = } j axwb u xj- 

x X,v 

The subscripts indicating the indices to be contracted are separated by a semicolon. 
It is also convenient to introduce a notation when contraction is performed in all but 
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or a few indices. For example the products in (|A.2[) and (|A.2[) may also be written 



(i,B)i,2 = (AB)-3 or (A8)i = (Afi)- M . 
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